Lemmatization remains a foundational yet challenging task in the processing of historical Italian texts, due to the complex interplay of orthographic, morphological, and diatopic variation. A crucial, yet often overlooked, aspect is the degree of normalization applied during lemmatization. A conservative approach preserves attested historical forms, ensuring greater linguistic fidelity but increasing data sparsity. Conversely, an abstract normalization strategy aligns historical variants with standardized contemporary lemmas, improving generalization but potentially introducing inaccurate mappings. In this paper, we present a comparative evaluation of conservative and normalized lemmatization strategies for historical Italian. To our knowledge, this is the first study to explicitly assess the impact of lemmatization strategies in the context of historical languages, particularly those that are morphologically rich. Our results indicate that high-level normalization offers a promising trade-off between precision and generalization.
Low- vs High-level Lemmatization for Historical Languages. A Case study on Italian
Chiara Alzetta
;Simonetta Montemagni
2025
Abstract
Lemmatization remains a foundational yet challenging task in the processing of historical Italian texts, due to the complex interplay of orthographic, morphological, and diatopic variation. A crucial, yet often overlooked, aspect is the degree of normalization applied during lemmatization. A conservative approach preserves attested historical forms, ensuring greater linguistic fidelity but increasing data sparsity. Conversely, an abstract normalization strategy aligns historical variants with standardized contemporary lemmas, improving generalization but potentially introducing inaccurate mappings. In this paper, we present a comparative evaluation of conservative and normalized lemmatization strategies for historical Italian. To our knowledge, this is the first study to explicitly assess the impact of lemmatization strategies in the context of historical languages, particularly those that are morphologically rich. Our results indicate that high-level normalization offers a promising trade-off between precision and generalization.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Chiara Alzetta | en |
| dc.authority.people | Simonetta Montemagni | en |
| dc.authority.project | Progetto PE PNRR "Cultural Heritage Innovation for Next-Gen Sustainable Society"" | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.firstsubmission | 2026/03/04 19:04:20 | * |
| dc.date.issued | 2025 | - |
| dc.date.submission | 2026/03/04 19:04:20 | * |
| dc.description.abstracteng | Lemmatization remains a foundational yet challenging task in the processing of historical Italian texts, due to the complex interplay of orthographic, morphological, and diatopic variation. A crucial, yet often overlooked, aspect is the degree of normalization applied during lemmatization. A conservative approach preserves attested historical forms, ensuring greater linguistic fidelity but increasing data sparsity. Conversely, an abstract normalization strategy aligns historical variants with standardized contemporary lemmas, improving generalization but potentially introducing inaccurate mappings. In this paper, we present a comparative evaluation of conservative and normalized lemmatization strategies for historical Italian. To our knowledge, this is the first study to explicitly assess the impact of lemmatization strategies in the context of historical languages, particularly those that are morphologically rich. Our results indicate that high-level normalization offers a promising trade-off between precision and generalization. | - |
| dc.description.allpeople | Alzetta, Chiara; Montemagni, Simonetta | - |
| dc.description.allpeopleoriginal | Chiara Alzetta; Simonetta Montemagni | en |
| dc.description.fulltext | none | en |
| dc.description.numberofauthors | 2 | - |
| dc.identifier.isbn | 979-12-243-0587-3 | en |
| dc.identifier.source | manual | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/571223 | - |
| dc.identifier.url | https://aclanthology.org/2025.clicit-1.4.pdf | en |
| dc.language.iso | eng | en |
| dc.publisher.name | CEUR Workshop Proceeding | en |
| dc.relation.conferencedate | 24-26 settembre 2025 | en |
| dc.relation.conferencename | Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) | en |
| dc.relation.conferenceplace | Cagliari | en |
| dc.relation.ispartofbook | Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) | en |
| dc.relation.medium | ELETTRONICO | en |
| dc.relation.numberofpages | 10 | en |
| dc.relation.projectAcronym | CHANGES | en |
| dc.relation.projectAwardNumber | - | en |
| dc.relation.projectAwardTitle | Progetto PE PNRR "Cultural Heritage Innovation for Next-Gen Sustainable Society"" | en |
| dc.relation.projectFunderName | MUR | en |
| dc.relation.projectFundingStream | - | en |
| dc.subject.keywordseng | Data-driven Lemmatization, Historical Italian, Universal Dependencies, Normalization | - |
| dc.subject.singlekeyword | Data-driven Lemmatization | * |
| dc.subject.singlekeyword | Historical Italian | * |
| dc.subject.singlekeyword | Universal Dependencies | * |
| dc.subject.singlekeyword | Normalization | * |
| dc.title | Low- vs High-level Lemmatization for Historical Languages. A Case study on Italian | en |
| dc.type.circulation | Nazionale | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.impactfactor | si | en |
| dc.type.invited | contributo | en |
| dc.type.miur | 273 | - |
| dc.type.referee | Esperti anonimi | en |
| iris.orcid.lastModifiedDate | 2026/03/04 19:04:20 | * |
| iris.orcid.lastModifiedMillisecond | 1772647460339 | * |
| iris.sitodocente.maxattempts | 1 | - |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


