The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
Optimizing LLMs for Italian: reducing token fertility and enhancing efficiency through vocabulary adaptation
Puccetti G.;Miaschi A.;Dell'Orletta F.;Esuli A.;
2025
Abstract
The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI | en |
| dc.authority.people | Moroni L. | en |
| dc.authority.people | Puccetti G. | en |
| dc.authority.people | Huguet Cabot P. -L. | en |
| dc.authority.people | Bejgu A. S. | en |
| dc.authority.people | Barba E. | en |
| dc.authority.people | Miaschi A. | en |
| dc.authority.people | Dell'Orletta F. | en |
| dc.authority.people | Esuli A. | en |
| dc.authority.people | Navigli R. | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.appartenenza.mi | 973 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2025/09/01 21:41:16 | - |
| dc.date.available | 2025/09/01 21:41:16 | - |
| dc.date.firstsubmission | 2025/08/26 16:58:47 | * |
| dc.date.issued | 2025 | - |
| dc.date.submission | 2025/08/26 17:12:17 | * |
| dc.description.abstracteng | The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks. | - |
| dc.description.allpeople | Moroni, L.; Puccetti, G.; Huguet Cabot, P. -L.; Bejgu, A. S.; Barba, E.; Miaschi, A.; Dell'Orletta, F.; Esuli, A.; Navigli, R. | - |
| dc.description.allpeopleoriginal | Moroni L.; Puccetti G.; Huguet Cabot P.-L.; Bejgu A.S.; Barba E.; Miaschi A.; Dell'Orletta F.; Esuli A.; Navigli R. | en |
| dc.description.fulltext | open | en |
| dc.description.numberofauthors | 9 | - |
| dc.identifier.doi | 10.18653/v1/2025.findings-naacl.371 | en |
| dc.identifier.isbn | 979-8-89176-195-7 | en |
| dc.identifier.scopus | 2-s2.0-105028681852 | - |
| dc.identifier.source | bibtex | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/552066 | - |
| dc.identifier.url | https://aclanthology.org/2025.findings-naacl.371/ | en |
| dc.language.iso | eng | en |
| dc.publisher.name | Association for Computational Linguistics | en |
| dc.relation.allauthors | Chiruzzo Luis, Ritter Alan, Wang Lu (eds.) | en |
| dc.relation.conferencedate | 29/04–04/05/2025 | en |
| dc.relation.conferencename | NAACL 2025 - Annual Conference of the Nations of the Americas Chapter. Findings of the Association for Computational Linguistics | en |
| dc.relation.conferenceplace | Albuquerque, New Mexico | en |
| dc.relation.firstpage | 6646 | en |
| dc.relation.ispartofbook | NAACL 2025 Findings proceedings | en |
| dc.relation.lastpage | 6660 | en |
| dc.relation.medium | ELETTRONICO | en |
| dc.relation.numberofpages | 15 | en |
| dc.subject.keywordseng | Large Languiage Models, Italia LLM, Vocabulary Adaptation | - |
| dc.subject.singlekeyword | Large Languiage Models | * |
| dc.subject.singlekeyword | Italia LLM | * |
| dc.subject.singlekeyword | Vocabulary Adaptation | * |
| dc.title | Optimizing LLMs for Italian: reducing token fertility and enhancing efficiency through vocabulary adaptation | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| iris.mediafilter.data | 2025/09/02 04:01:00 | * |
| iris.orcid.lastModifiedDate | 2026/04/20 14:58:40 | * |
| iris.orcid.lastModifiedMillisecond | 1776689920253 | * |
| iris.scopus.extIssued | 2025 | - |
| iris.scopus.extTitle | Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation | - |
| iris.scopus.ideLinkStatusDate | 2026/04/20 14:58:40 | * |
| iris.scopus.ideLinkStatusMillisecond | 1776689920286 | * |
| iris.sitodocente.maxattempts | 1 | - |
| iris.unpaywall.bestoaversion | publishedVersion | * |
| iris.unpaywall.doi | 10.18653/v1/2025.findings-naacl.371 | * |
| iris.unpaywall.isoa | true | * |
| iris.unpaywall.journalisindoaj | false | * |
| iris.unpaywall.landingpage | https://doi.org/10.18653/v1/2025.findings-naacl.371 | * |
| iris.unpaywall.license | cc-by | * |
| iris.unpaywall.metadataCallLastModified | 28/04/2026 05:03:33 | - |
| iris.unpaywall.metadataCallLastModifiedMillisecond | 1777345413219 | - |
| iris.unpaywall.oastatus | gold | * |
| iris.unpaywall.pdfurl | https://aclanthology.org/2025.findings-naacl.371.pdf | * |
| scopus.category | 1712 | * |
| scopus.category | 1710 | * |
| scopus.category | 1708 | * |
| scopus.category | 1705 | * |
| scopus.contributor.affiliation | Sapienza University of Rome | - |
| scopus.contributor.affiliation | ISTI-CNR | - |
| scopus.contributor.affiliation | Sapienza University of Rome | - |
| scopus.contributor.affiliation | Babelscape | - |
| scopus.contributor.affiliation | Sapienza University of Rome | - |
| scopus.contributor.affiliation | ILC-CNR | - |
| scopus.contributor.affiliation | ILC-CNR | - |
| scopus.contributor.affiliation | ISTI-CNR | - |
| scopus.contributor.affiliation | Sapienza University of Rome | - |
| scopus.contributor.afid | 60032350 | - |
| scopus.contributor.afid | 60085207 | - |
| scopus.contributor.afid | 60032350 | - |
| scopus.contributor.afid | 60355563 | - |
| scopus.contributor.afid | 60032350 | - |
| scopus.contributor.afid | 100821753 | - |
| scopus.contributor.afid | 100821753 | - |
| scopus.contributor.afid | 60085207 | - |
| scopus.contributor.afid | 60032350 | - |
| scopus.contributor.auid | 59505682200 | - |
| scopus.contributor.auid | 57220748419 | - |
| scopus.contributor.auid | 57222466385 | - |
| scopus.contributor.auid | 58789690500 | - |
| scopus.contributor.auid | 57220543663 | - |
| scopus.contributor.auid | 57211678681 | - |
| scopus.contributor.auid | 57540567000 | - |
| scopus.contributor.auid | 15044356100 | - |
| scopus.contributor.auid | 6507102454 | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.name | Luca | - |
| scopus.contributor.name | Giovanni | - |
| scopus.contributor.name | Pere-Lluis Huguet | - |
| scopus.contributor.name | Andrei Stefan | - |
| scopus.contributor.name | Edoardo | - |
| scopus.contributor.name | Alessio | - |
| scopus.contributor.name | Felice | - |
| scopus.contributor.name | Andrea | - |
| scopus.contributor.name | Roberto | - |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.surname | Moroni | - |
| scopus.contributor.surname | Puccetti | - |
| scopus.contributor.surname | Cabot | - |
| scopus.contributor.surname | Bejgu | - |
| scopus.contributor.surname | Barba | - |
| scopus.contributor.surname | Miaschi | - |
| scopus.contributor.surname | Dell’Orletta | - |
| scopus.contributor.surname | Esuli | - |
| scopus.contributor.surname | Navigli | - |
| scopus.date.issued | 2025 | * |
| scopus.description.abstracteng | The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks. | * |
| scopus.description.allpeopleoriginal | Moroni L.; Puccetti G.; Cabot P.-L.H.; Bejgu A.S.; Barba E.; Miaschi A.; Dell'Orletta F.; Esuli A.; Navigli R. | * |
| scopus.differences | scopus.publisher.name | * |
| scopus.differences | scopus.relation.lastpage | * |
| scopus.differences | scopus.relation.conferencedate | * |
| scopus.differences | scopus.relation.firstpage | * |
| scopus.differences | scopus.description.allpeopleoriginal | * |
| scopus.differences | scopus.description.abstracteng | * |
| scopus.differences | scopus.relation.conferencename | * |
| scopus.differences | scopus.identifier.isbn | * |
| scopus.differences | scopus.relation.conferenceplace | * |
| scopus.document.type | cp | * |
| scopus.document.types | cp | * |
| scopus.funding.funders | 100031060 - European High Performance Computing Joint Undertaking; 501100000780 - European Commission; 501100003407 - Ministero dell’Istruzione, dell’Università e della Ricerca; 501100003407 - Ministero dell’Istruzione, dell’Università e della Ricerca; | * |
| scopus.funding.ids | HP10CY9V7K; 2022EPTPJ9; CUP B53C22001770006; PNRR - PRIN 2022; | * |
| scopus.identifier.doi | 10.18653/v1/2025.findings-naacl.371 | * |
| scopus.identifier.isbn | 9798891761957 | * |
| scopus.identifier.pui | 650048217 | * |
| scopus.identifier.scopus | 2-s2.0-105028681852 | * |
| scopus.journal.sourceid | 21101390114 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | Association for Computational Linguistics (ACL) | * |
| scopus.relation.conferencedate | 2025 | * |
| scopus.relation.conferencename | 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025 | * |
| scopus.relation.conferenceplace | usa | * |
| scopus.relation.firstpage | 6661 | * |
| scopus.relation.lastpage | 6675 | * |
| scopus.title | Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation | * |
| scopus.titleeng | Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation | * |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
2025.findings-naacl.371.pdf
accesso aperto
Descrizione: Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
905 kB
Formato
Adobe PDF
|
905 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


