In this paper we describe some experiments related to a corpus derived from an authoritative historical Italian dictionary, namely the Grande dizionario della lingua italiana (‘Great Dictionary of Italian Language’, in short GDLI). Thanks to the digitization and structuring of this dictionary, we have been able to set up the first nucleus of a diachronic annotated corpus that selects—according to specific criteria, and distinguishing between prose and poetry—some of the quotations that within the entries illustrate the different definitions and sub-definitions. In fact, the GDLI presents a huge collection of quotations covering the entire history of the Italian language and thus ranging from the Middle Ages to the present day. The corpus was enriched with linguistic annotation and used to train and evaluate NLP models for POS tagging and lemmatization, with promising results.

Towards the Creation of a Diachronic Corpus for Italian: a Case Study on the GDLI Quotations

Manuel Favaro
;
Elisa Guadagnini;Eva Sassolini;Simonetta Montemagni
2022

Abstract

In this paper we describe some experiments related to a corpus derived from an authoritative historical Italian dictionary, namely the Grande dizionario della lingua italiana (‘Great Dictionary of Italian Language’, in short GDLI). Thanks to the digitization and structuring of this dictionary, we have been able to set up the first nucleus of a diachronic annotated corpus that selects—according to specific criteria, and distinguishing between prose and poetry—some of the quotations that within the entries illustrate the different definitions and sub-definitions. In fact, the GDLI presents a huge collection of quotations covering the entire history of the Italian language and thus ranging from the Middle Ages to the present day. The corpus was enriched with linguistic annotation and used to train and evaluate NLP models for POS tagging and lemmatization, with promising results.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Manuel Favaro en
dc.authority.people Elisa Guadagnini en
dc.authority.people Eva Sassolini en
dc.authority.people Marco Biffi en
dc.authority.people Simonetta Montemagni en
dc.authority.project DUS.AD017.115 / CNR4C - Regione Toscana en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2025/02/25 17:54:40 -
dc.date.available 2025/02/25 17:54:40 -
dc.date.firstsubmission 2025/02/05 23:26:53 *
dc.date.issued 2022 -
dc.date.submission 2025/02/25 17:54:05 *
dc.description.abstracteng In this paper we describe some experiments related to a corpus derived from an authoritative historical Italian dictionary, namely the Grande dizionario della lingua italiana (‘Great Dictionary of Italian Language’, in short GDLI). Thanks to the digitization and structuring of this dictionary, we have been able to set up the first nucleus of a diachronic annotated corpus that selects—according to specific criteria, and distinguishing between prose and poetry—some of the quotations that within the entries illustrate the different definitions and sub-definitions. In fact, the GDLI presents a huge collection of quotations covering the entire history of the Italian language and thus ranging from the Middle Ages to the present day. The corpus was enriched with linguistic annotation and used to train and evaluate NLP models for POS tagging and lemmatization, with promising results. -
dc.description.allpeople Favaro, Manuel; Guadagnini, Elisa; Sassolini, Eva; Biffi, Marco; Montemagni, Simonetta -
dc.description.allpeopleoriginal Manuel Favaro, Elisa Guadagnini, Eva Sassolini, Marco Biffi, Simonetta Montemagni en
dc.description.fulltext open en
dc.description.international no en
dc.description.numberofauthors 5 -
dc.identifier.isbn 979-10-95546-78-8 en
dc.identifier.source manual *
dc.identifier.uri https://hdl.handle.net/20.500.14243/533922 -
dc.identifier.url http://www.lrec-conf.org/proceedings/lrec2022/workshops/LT4HALA/pdf/2022.lt4hala2022-1.13.pdf en
dc.language.iso eng en
dc.publisher.country FRA en
dc.publisher.name European Language Resources Association (ELRA) en
dc.publisher.place Paris en
dc.relation.alleditors Rachele Sprugnoli, Marco Passarotti en
dc.relation.conferencedate 20-25/06/2022 en
dc.relation.conferencename 2nd Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2022) en
dc.relation.conferenceplace Marseille en
dc.relation.firstpage 94 en
dc.relation.ispartofbook Proceedings of the 2nd Workshop on Language Technologies for Historical and Ancient Languages en
dc.relation.lastpage 100 en
dc.relation.numberofpages 7 en
dc.relation.projectAcronym - en
dc.relation.projectAwardNumber - en
dc.relation.projectAwardTitle DUS.AD017.115 / CNR4C - Regione Toscana en
dc.relation.projectFunderName - en
dc.relation.projectFundingStream - en
dc.subject.keywordseng Diachronic Corpus, Adaptation of Annotation Tools, Historical Dictionaries -
dc.subject.singlekeyword Diachronic Corpus *
dc.subject.singlekeyword Adaptation of Annotation Tools *
dc.subject.singlekeyword Historical Dictionaries *
dc.title Towards the Creation of a Diachronic Corpus for Italian: a Case Study on the GDLI Quotations en
dc.type.circulation Internazionale en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.impactfactor no en
dc.type.invited contributo en
dc.type.miur 273 -
dc.type.referee Esperti anonimi en
iris.mediafilter.data 2025/04/03 04:08:53 *
iris.orcid.lastModifiedDate 2025/02/25 17:54:40 *
iris.orcid.lastModifiedMillisecond 1740502480086 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
2022.lt4hala-1.13.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 373.04 kB
Formato Adobe PDF
373.04 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/533922
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact