Textual scholars have been exploiting for long time multilingual resources in their daily work to better understand the primary sources they inquire. Bitexts are parallel texts which turn out to be useful in a number of cross-linguistic and comparative processing tasks. This talk will show the workflow adopted within the research activities conducted on the Italian translation of the Babylonian Talmud. More specifically, I will illustrate the ongoing work towards the construction of a multilingual Hebrew/Aramaic/Italian terminological resource by means of stochastic generative approaches to word-by-word text alignment. The related literature discusses plenty of techniques concerning this topic. The alignment tool I developed is grounded on generative models (i.e., IBM and HMM models), which are a collection of non-supervised machine learning algorithms, to calculate the probability of linking two words in a multilingual term pair. From a technical standpoint, beside the adopted models, which are based on an alignment function and on an unsupervised training procedure devoted to estimating the unknown probability distributions, other machine learning approaches to word alignment exist that encompass discriminative techniques, which are based on a target function and on a supervised learning process exploiting labeled training data set. The implemented models were widely adopted in the literary domain, as they are able to profitably handle interpretative bitexts modeling also deletion, insertion, transposition phenomena without having an extant labeled data set. The workflow I will present encompasses four distinct phases: 1) The encoding of the parallel text, which has been carried out according to the last TEI recommendations. In particular, the linking-target approach described within the Module 16 of the guidelines was used. 2) The semi-automatic extraction of the Italian terms, which has been carried out by means of linguistic analysis technologies available at the Institute of Computational Linguistics (ILC-CNR). These tools include a stochastic component for terminology extraction. 3) The addition of Hebrew/Aramaic terms to the Italian extracted ones via word-by-word alignment to automatically process the three main ancient languages appearing in the Talmud, namely mishnaic Hebrew, biblical Hebrew and babylonian Aramaic. 4) Finally, the revision of the obtained results through an ad-hoc implemented web-based application. This final step is devoted to build a ground truth and/or a gold training set allowing us to perform a complete validation process of the alignment outcomes. For the time being, 219.000 tokens have been analyzed, extracted from four tractates of the Babylonian Talmud which were translated so far."

Multilingual Word-by-word alignment. Methodology and some preliminary outcomes towards the construction of multilingual Lexicon within the "Traduzione del Talmud Babilonese" project

Angelo Mario Del Grosso
2019

Abstract

Textual scholars have been exploiting for long time multilingual resources in their daily work to better understand the primary sources they inquire. Bitexts are parallel texts which turn out to be useful in a number of cross-linguistic and comparative processing tasks. This talk will show the workflow adopted within the research activities conducted on the Italian translation of the Babylonian Talmud. More specifically, I will illustrate the ongoing work towards the construction of a multilingual Hebrew/Aramaic/Italian terminological resource by means of stochastic generative approaches to word-by-word text alignment. The related literature discusses plenty of techniques concerning this topic. The alignment tool I developed is grounded on generative models (i.e., IBM and HMM models), which are a collection of non-supervised machine learning algorithms, to calculate the probability of linking two words in a multilingual term pair. From a technical standpoint, beside the adopted models, which are based on an alignment function and on an unsupervised training procedure devoted to estimating the unknown probability distributions, other machine learning approaches to word alignment exist that encompass discriminative techniques, which are based on a target function and on a supervised learning process exploiting labeled training data set. The implemented models were widely adopted in the literary domain, as they are able to profitably handle interpretative bitexts modeling also deletion, insertion, transposition phenomena without having an extant labeled data set. The workflow I will present encompasses four distinct phases: 1) The encoding of the parallel text, which has been carried out according to the last TEI recommendations. In particular, the linking-target approach described within the Module 16 of the guidelines was used. 2) The semi-automatic extraction of the Italian terms, which has been carried out by means of linguistic analysis technologies available at the Institute of Computational Linguistics (ILC-CNR). These tools include a stochastic component for terminology extraction. 3) The addition of Hebrew/Aramaic terms to the Italian extracted ones via word-by-word alignment to automatically process the three main ancient languages appearing in the Talmud, namely mishnaic Hebrew, biblical Hebrew and babylonian Aramaic. 4) Finally, the revision of the obtained results through an ad-hoc implemented web-based application. This final step is devoted to build a ground truth and/or a gold training set allowing us to perform a complete validation process of the alignment outcomes. For the time being, 219.000 tokens have been analyzed, extracted from four tractates of the Babylonian Talmud which were translated so far."
2019
Istituto di linguistica computazionale "Antonio Zampolli" - ILC
bilingual word alignment
translation
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/406308
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact