CNR Institutional Research Information System

In the last decade, the demand for readily accessible corpora has touched all areas of natural language processing, including coreference resolution. However, it is one of the least considered sub-fields in recent developments. Moreover, almost all existing resources are only available for the English language. To overcome this lack, this work proposes a methodology to create a corpus for coreference resolution in Italian exploiting knowledge of annotated resources in other languages. Starting from OntonNotes, the methodology translates and refines English utterances to obtain utterances respecting Italian grammar, dealing with language-specific phenomena and preserving coreference and mentions. A quantitative and qualitative evaluation is performed to assess the well-formedness of generated utterances, considering readability, grammaticality, and acceptability indexes. The results have confirmed the effectiveness of the methodology in generating a good dataset for coreference resolution starting from an existing one. The goodness of the dataset is also assessed by training a coreference resolution model based on BERT language model, achieving the promising results. Even if the methodology has been tailored for English and Italian languages, it has a general basis easily extendable to other languages, adapting a small number of language-dependent rules to generalize most of the linguistic phenomena of the language under examination.

A multi-level methodology for the automated translation of a coreference resolution dataset: an application to the Italian language

Aniello Minutolo^Primo;Raffaele Guarasci;Emanuele Damiano;Giuseppe De Pietro;Hamido Fujita;Massimo Esposito^Ultimo

2022

Abstract

In the last decade, the demand for readily accessible corpora has touched all areas of natural language processing, including coreference resolution. However, it is one of the least considered sub-fields in recent developments. Moreover, almost all existing resources are only available for the English language. To overcome this lack, this work proposes a methodology to create a corpus for coreference resolution in Italian exploiting knowledge of annotated resources in other languages. Starting from OntonNotes, the methodology translates and refines English utterances to obtain utterances respecting Italian grammar, dealing with language-specific phenomena and preserving coreference and mentions. A quantitative and qualitative evaluation is performed to assess the well-formedness of generated utterances, considering readability, grammaticality, and acceptability indexes. The results have confirmed the effectiveness of the methodology in generating a good dataset for coreference resolution starting from an existing one. The goodness of the dataset is also assessed by training a coreference resolution model based on BERT language model, achieving the promising results. Even if the methodology has been tailored for English and Italian languages, it has a general basis easily extendable to other languages, adapting a small number of language-dependent rules to generalize most of the linguistic phenomena of the language under examination.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Parole chiave
	
				Coreference resolution
Corpus creation
Automated translation
Cross-language
Natural language processing
Linguistic phenomena
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_471302-doc_191359.pdf solo utenti autorizzati Descrizione: A multi-level methodology for the automated translation of a coreference resolution dataset - an application to the Italian language Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 2.53 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.53 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/417077

Citazioni

ND

13

8

social impact