Lemmatization remains a foundational yet challenging task in the processing of historical Italian texts, due to the complex interplay of orthographic, morphological, and diatopic variation. A crucial, yet often overlooked, aspect is the degree of normalization applied during lemmatization. A conservative approach preserves attested historical forms, ensuring greater linguistic fidelity but increasing data sparsity. Conversely, an abstract normalization strategy aligns historical variants with standardized contemporary lemmas, improving generalization but potentially introducing inaccurate mappings. In this paper, we present a comparative evaluation of conservative and normalized lemmatization strategies for historical Italian. To our knowledge, this is the first study to explicitly assess the impact of lemmatization strategies in the context of historical languages, particularly those that are morphologically rich. Our results indicate that high-level normalization offers a promising trade-off between precision and generalization.

Low- vs High-level Lemmatization for Historical Languages. A Case study on Italian

Chiara Alzetta
;
Simonetta Montemagni
2025

Abstract

Lemmatization remains a foundational yet challenging task in the processing of historical Italian texts, due to the complex interplay of orthographic, morphological, and diatopic variation. A crucial, yet often overlooked, aspect is the degree of normalization applied during lemmatization. A conservative approach preserves attested historical forms, ensuring greater linguistic fidelity but increasing data sparsity. Conversely, an abstract normalization strategy aligns historical variants with standardized contemporary lemmas, improving generalization but potentially introducing inaccurate mappings. In this paper, we present a comparative evaluation of conservative and normalized lemmatization strategies for historical Italian. To our knowledge, this is the first study to explicitly assess the impact of lemmatization strategies in the context of historical languages, particularly those that are morphologically rich. Our results indicate that high-level normalization offers a promising trade-off between precision and generalization.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Chiara Alzetta en
dc.authority.people Simonetta Montemagni en
dc.authority.project Progetto PE PNRR "Cultural Heritage Innovation for Next-Gen Sustainable Society"" en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.firstsubmission 2026/03/04 19:04:20 *
dc.date.issued 2025 -
dc.date.submission 2026/03/04 19:04:20 *
dc.description.abstracteng Lemmatization remains a foundational yet challenging task in the processing of historical Italian texts, due to the complex interplay of orthographic, morphological, and diatopic variation. A crucial, yet often overlooked, aspect is the degree of normalization applied during lemmatization. A conservative approach preserves attested historical forms, ensuring greater linguistic fidelity but increasing data sparsity. Conversely, an abstract normalization strategy aligns historical variants with standardized contemporary lemmas, improving generalization but potentially introducing inaccurate mappings. In this paper, we present a comparative evaluation of conservative and normalized lemmatization strategies for historical Italian. To our knowledge, this is the first study to explicitly assess the impact of lemmatization strategies in the context of historical languages, particularly those that are morphologically rich. Our results indicate that high-level normalization offers a promising trade-off between precision and generalization. -
dc.description.allpeople Alzetta, Chiara; Montemagni, Simonetta -
dc.description.allpeopleoriginal Chiara Alzetta; Simonetta Montemagni en
dc.description.fulltext none en
dc.description.numberofauthors 2 -
dc.identifier.isbn 979-12-243-0587-3 en
dc.identifier.source manual *
dc.identifier.uri https://hdl.handle.net/20.500.14243/571223 -
dc.identifier.url https://aclanthology.org/2025.clicit-1.4.pdf en
dc.language.iso eng en
dc.publisher.name CEUR Workshop Proceeding en
dc.relation.conferencedate 24-26 settembre 2025 en
dc.relation.conferencename Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) en
dc.relation.conferenceplace Cagliari en
dc.relation.ispartofbook Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) en
dc.relation.medium ELETTRONICO en
dc.relation.numberofpages 10 en
dc.relation.projectAcronym CHANGES en
dc.relation.projectAwardNumber - en
dc.relation.projectAwardTitle Progetto PE PNRR "Cultural Heritage Innovation for Next-Gen Sustainable Society"" en
dc.relation.projectFunderName MUR en
dc.relation.projectFundingStream - en
dc.subject.keywordseng Data-driven Lemmatization, Historical Italian, Universal Dependencies, Normalization -
dc.subject.singlekeyword Data-driven Lemmatization *
dc.subject.singlekeyword Historical Italian *
dc.subject.singlekeyword Universal Dependencies *
dc.subject.singlekeyword Normalization *
dc.title Low- vs High-level Lemmatization for Historical Languages. A Case study on Italian en
dc.type.circulation Nazionale en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.impactfactor si en
dc.type.invited contributo en
dc.type.miur 273 -
dc.type.referee Esperti anonimi en
iris.orcid.lastModifiedDate 2026/03/04 19:04:20 *
iris.orcid.lastModifiedMillisecond 1772647460339 *
iris.sitodocente.maxattempts 1 -
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/571223
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ente

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact