The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Optimizing LLMs for Italian: reducing token fertility and enhancing efficiency through vocabulary adaptation

Puccetti G.;Miaschi A.;Dell'Orletta F.;Esuli A.;
2025

Abstract

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI en
dc.authority.people Moroni L. en
dc.authority.people Puccetti G. en
dc.authority.people Huguet Cabot P. -L. en
dc.authority.people Bejgu A. S. en
dc.authority.people Barba E. en
dc.authority.people Miaschi A. en
dc.authority.people Dell'Orletta F. en
dc.authority.people Esuli A. en
dc.authority.people Navigli R. en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 973 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2025/09/01 21:41:16 -
dc.date.available 2025/09/01 21:41:16 -
dc.date.firstsubmission 2025/08/26 16:58:47 *
dc.date.issued 2025 -
dc.date.submission 2025/08/26 17:12:17 *
dc.description.abstracteng The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks. -
dc.description.allpeople Moroni, L.; Puccetti, G.; Huguet Cabot, P. -L.; Bejgu, A. S.; Barba, E.; Miaschi, A.; Dell'Orletta, F.; Esuli, A.; Navigli, R. -
dc.description.allpeopleoriginal Moroni L.; Puccetti G.; Huguet Cabot P.-L.; Bejgu A.S.; Barba E.; Miaschi A.; Dell'Orletta F.; Esuli A.; Navigli R. en
dc.description.fulltext open en
dc.description.numberofauthors 9 -
dc.identifier.doi 10.18653/v1/2025.findings-naacl.371 en
dc.identifier.isbn 979-8-89176-195-7 en
dc.identifier.scopus 2-s2.0-105028681852 -
dc.identifier.source bibtex *
dc.identifier.uri https://hdl.handle.net/20.500.14243/552066 -
dc.identifier.url https://aclanthology.org/2025.findings-naacl.371/ en
dc.language.iso eng en
dc.publisher.name Association for Computational Linguistics en
dc.relation.allauthors Chiruzzo Luis, Ritter Alan, Wang Lu (eds.) en
dc.relation.conferencedate 29/04–04/05/2025 en
dc.relation.conferencename NAACL 2025 - Annual Conference of the Nations of the Americas Chapter. Findings of the Association for Computational Linguistics en
dc.relation.conferenceplace Albuquerque, New Mexico en
dc.relation.firstpage 6646 en
dc.relation.ispartofbook NAACL 2025 Findings proceedings en
dc.relation.lastpage 6660 en
dc.relation.medium ELETTRONICO en
dc.relation.numberofpages 15 en
dc.subject.keywordseng Large Languiage Models, Italia LLM, Vocabulary Adaptation -
dc.subject.singlekeyword Large Languiage Models *
dc.subject.singlekeyword Italia LLM *
dc.subject.singlekeyword Vocabulary Adaptation *
dc.title Optimizing LLMs for Italian: reducing token fertility and enhancing efficiency through vocabulary adaptation en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
iris.mediafilter.data 2025/09/02 04:01:00 *
iris.orcid.lastModifiedDate 2026/04/20 14:58:40 *
iris.orcid.lastModifiedMillisecond 1776689920253 *
iris.scopus.extIssued 2025 -
iris.scopus.extTitle Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation -
iris.scopus.ideLinkStatusDate 2026/04/20 14:58:40 *
iris.scopus.ideLinkStatusMillisecond 1776689920286 *
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.18653/v1/2025.findings-naacl.371 *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.landingpage https://doi.org/10.18653/v1/2025.findings-naacl.371 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 28/04/2026 05:03:33 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1777345413219 -
iris.unpaywall.oastatus gold *
iris.unpaywall.pdfurl https://aclanthology.org/2025.findings-naacl.371.pdf *
scopus.category 1712 *
scopus.category 1710 *
scopus.category 1708 *
scopus.category 1705 *
scopus.contributor.affiliation Sapienza University of Rome -
scopus.contributor.affiliation ISTI-CNR -
scopus.contributor.affiliation Sapienza University of Rome -
scopus.contributor.affiliation Babelscape -
scopus.contributor.affiliation Sapienza University of Rome -
scopus.contributor.affiliation ILC-CNR -
scopus.contributor.affiliation ILC-CNR -
scopus.contributor.affiliation ISTI-CNR -
scopus.contributor.affiliation Sapienza University of Rome -
scopus.contributor.afid 60032350 -
scopus.contributor.afid 60085207 -
scopus.contributor.afid 60032350 -
scopus.contributor.afid 60355563 -
scopus.contributor.afid 60032350 -
scopus.contributor.afid 100821753 -
scopus.contributor.afid 100821753 -
scopus.contributor.afid 60085207 -
scopus.contributor.afid 60032350 -
scopus.contributor.auid 59505682200 -
scopus.contributor.auid 57220748419 -
scopus.contributor.auid 57222466385 -
scopus.contributor.auid 58789690500 -
scopus.contributor.auid 57220543663 -
scopus.contributor.auid 57211678681 -
scopus.contributor.auid 57540567000 -
scopus.contributor.auid 15044356100 -
scopus.contributor.auid 6507102454 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.name Luca -
scopus.contributor.name Giovanni -
scopus.contributor.name Pere-Lluis Huguet -
scopus.contributor.name Andrei Stefan -
scopus.contributor.name Edoardo -
scopus.contributor.name Alessio -
scopus.contributor.name Felice -
scopus.contributor.name Andrea -
scopus.contributor.name Roberto -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.surname Moroni -
scopus.contributor.surname Puccetti -
scopus.contributor.surname Cabot -
scopus.contributor.surname Bejgu -
scopus.contributor.surname Barba -
scopus.contributor.surname Miaschi -
scopus.contributor.surname Dell’Orletta -
scopus.contributor.surname Esuli -
scopus.contributor.surname Navigli -
scopus.date.issued 2025 *
scopus.description.abstracteng The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks. *
scopus.description.allpeopleoriginal Moroni L.; Puccetti G.; Cabot P.-L.H.; Bejgu A.S.; Barba E.; Miaschi A.; Dell'Orletta F.; Esuli A.; Navigli R. *
scopus.differences scopus.publisher.name *
scopus.differences scopus.relation.lastpage *
scopus.differences scopus.relation.conferencedate *
scopus.differences scopus.relation.firstpage *
scopus.differences scopus.description.allpeopleoriginal *
scopus.differences scopus.description.abstracteng *
scopus.differences scopus.relation.conferencename *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.relation.conferenceplace *
scopus.document.type cp *
scopus.document.types cp *
scopus.funding.funders 100031060 - European High Performance Computing Joint Undertaking; 501100000780 - European Commission; 501100003407 - Ministero dell’Istruzione, dell’Università e della Ricerca; 501100003407 - Ministero dell’Istruzione, dell’Università e della Ricerca; *
scopus.funding.ids HP10CY9V7K; 2022EPTPJ9; CUP B53C22001770006; PNRR - PRIN 2022; *
scopus.identifier.doi 10.18653/v1/2025.findings-naacl.371 *
scopus.identifier.isbn 9798891761957 *
scopus.identifier.pui 650048217 *
scopus.identifier.scopus 2-s2.0-105028681852 *
scopus.journal.sourceid 21101390114 *
scopus.language.iso eng *
scopus.publisher.name Association for Computational Linguistics (ACL) *
scopus.relation.conferencedate 2025 *
scopus.relation.conferencename 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025 *
scopus.relation.conferenceplace usa *
scopus.relation.firstpage 6661 *
scopus.relation.lastpage 6675 *
scopus.title Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation *
scopus.titleeng Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation *
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
2025.findings-naacl.371.pdf

accesso aperto

Descrizione: Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 905 kB
Formato Adobe PDF
905 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/552066
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
social impact