CNR Institutional Research Information System

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Optimizing LLMs for Italian: reducing token fertility and enhancing efficiency through vocabulary adaptation

Moroni L.;Puccetti G.;Huguet Cabot P. -L.;Bejgu A. S.;Barba E.;Miaschi A.;Dell'Orletta F.;Esuli A.;Navigli R.

2025

Abstract

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	en
dc.authority.people	Moroni L.	en
dc.authority.people	Puccetti G.	en
dc.authority.people	Huguet Cabot P. -L.	en
dc.authority.people	Bejgu A. S.	en
dc.authority.people	Barba E.	en
dc.authority.people	Miaschi A.	en
dc.authority.people	Dell'Orletta F.	en
dc.authority.people	Esuli A.	en
dc.authority.people	Navigli R.	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	973	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2025/09/01 21:41:16	-
dc.date.available	2025/09/01 21:41:16	-
dc.date.firstsubmission	2025/08/26 16:58:47	*
dc.date.issued	2025	-
dc.date.submission	2025/08/26 17:12:17	*
dc.description.abstracteng	The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.	-
dc.description.allpeople	Moroni, L.; Puccetti, G.; Huguet Cabot, P. -L.; Bejgu, A. S.; Barba, E.; Miaschi, A.; Dell'Orletta, F.; Esuli, A.; Navigli, R.	-
dc.description.allpeopleoriginal	Moroni L.; Puccetti G.; Huguet Cabot P.-L.; Bejgu A.S.; Barba E.; Miaschi A.; Dell'Orletta F.; Esuli A.; Navigli R.	en
dc.description.fulltext	open	en
dc.description.numberofauthors	9	-
dc.identifier.doi	10.18653/v1/2025.findings-naacl.371	en
dc.identifier.isbn	979-8-89176-195-7	en
dc.identifier.scopus	2-s2.0-105028681852	-
dc.identifier.source	bibtex	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/552066	-
dc.identifier.url	https://aclanthology.org/2025.findings-naacl.371/	en
dc.language.iso	eng	en
dc.publisher.name	Association for Computational Linguistics	en
dc.relation.allauthors	Chiruzzo Luis, Ritter Alan, Wang Lu (eds.)	en
dc.relation.conferencedate	29/04–04/05/2025	en
dc.relation.conferencename	NAACL 2025 - Annual Conference of the Nations of the Americas Chapter. Findings of the Association for Computational Linguistics	en
dc.relation.conferenceplace	Albuquerque, New Mexico	en
dc.relation.firstpage	6646	en
dc.relation.ispartofbook	NAACL 2025 Findings proceedings	en
dc.relation.lastpage	6660	en
dc.relation.medium	ELETTRONICO	en
dc.relation.numberofpages	15	en
dc.subject.keywordseng	Large Languiage Models, Italia LLM, Vocabulary Adaptation	-
dc.subject.singlekeyword	Large Languiage Models	*
dc.subject.singlekeyword	Italia LLM	*
dc.subject.singlekeyword	Vocabulary Adaptation	*
dc.title	Optimizing LLMs for Italian: reducing token fertility and enhancing efficiency through vocabulary adaptation	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
iris.mediafilter.data	2025/09/02 04:01:00	*
iris.orcid.lastModifiedDate	2026/04/20 14:58:40	*
iris.orcid.lastModifiedMillisecond	1776689920253	*
iris.scopus.extIssued	2025	-
iris.scopus.extTitle	Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation	-
iris.scopus.ideLinkStatusDate	2026/04/20 14:58:40	*
iris.scopus.ideLinkStatusMillisecond	1776689920286	*
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.18653/v1/2025.findings-naacl.371	*
iris.unpaywall.isoa	true	*
iris.unpaywall.journalisindoaj	false	*
iris.unpaywall.landingpage	https://doi.org/10.18653/v1/2025.findings-naacl.371	*
iris.unpaywall.license	cc-by	*
iris.unpaywall.metadataCallLastModified	28/04/2026 05:03:33	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1777345413219	-
iris.unpaywall.oastatus	gold	*
iris.unpaywall.pdfurl	https://aclanthology.org/2025.findings-naacl.371.pdf	*
scopus.category	1712	*
scopus.category	1710	*
scopus.category	1708	*
scopus.category	1705	*
scopus.contributor.affiliation	Sapienza University of Rome	-
scopus.contributor.affiliation	ISTI-CNR	-
scopus.contributor.affiliation	Sapienza University of Rome	-
scopus.contributor.affiliation	Babelscape	-
scopus.contributor.affiliation	Sapienza University of Rome	-
scopus.contributor.affiliation	ILC-CNR	-
scopus.contributor.affiliation	ILC-CNR	-
scopus.contributor.affiliation	ISTI-CNR	-
scopus.contributor.affiliation	Sapienza University of Rome	-
scopus.contributor.afid	60032350	-
scopus.contributor.afid	60085207	-
scopus.contributor.afid	60032350	-
scopus.contributor.afid	60355563	-
scopus.contributor.afid	60032350	-
scopus.contributor.afid	100821753	-
scopus.contributor.afid	100821753	-
scopus.contributor.afid	60085207	-
scopus.contributor.afid	60032350	-
scopus.contributor.auid	59505682200	-
scopus.contributor.auid	57220748419	-
scopus.contributor.auid	57222466385	-
scopus.contributor.auid	58789690500	-
scopus.contributor.auid	57220543663	-
scopus.contributor.auid	57211678681	-
scopus.contributor.auid	57540567000	-
scopus.contributor.auid	15044356100	-
scopus.contributor.auid	6507102454	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.name	Luca	-
scopus.contributor.name	Giovanni	-
scopus.contributor.name	Pere-Lluis Huguet	-
scopus.contributor.name	Andrei Stefan	-
scopus.contributor.name	Edoardo	-
scopus.contributor.name	Alessio	-
scopus.contributor.name	Felice	-
scopus.contributor.name	Andrea	-
scopus.contributor.name	Roberto	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.surname	Moroni	-
scopus.contributor.surname	Puccetti	-
scopus.contributor.surname	Cabot	-
scopus.contributor.surname	Bejgu	-
scopus.contributor.surname	Barba	-
scopus.contributor.surname	Miaschi	-
scopus.contributor.surname	Dell’Orletta	-
scopus.contributor.surname	Esuli	-
scopus.contributor.surname	Navigli	-
scopus.date.issued	2025	*
scopus.description.abstracteng	The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.	*
scopus.description.allpeopleoriginal	Moroni L.; Puccetti G.; Cabot P.-L.H.; Bejgu A.S.; Barba E.; Miaschi A.; Dell'Orletta F.; Esuli A.; Navigli R.	*
scopus.differences	scopus.publisher.name	*
scopus.differences	scopus.relation.lastpage	*
scopus.differences	scopus.relation.conferencedate	*
scopus.differences	scopus.relation.firstpage	*
scopus.differences	scopus.description.allpeopleoriginal	*
scopus.differences	scopus.description.abstracteng	*
scopus.differences	scopus.relation.conferencename	*
scopus.differences	scopus.identifier.isbn	*
scopus.differences	scopus.relation.conferenceplace	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.funding.funders	100031060 - European High Performance Computing Joint Undertaking; 501100000780 - European Commission; 501100003407 - Ministero dell’Istruzione, dell’Università e della Ricerca; 501100003407 - Ministero dell’Istruzione, dell’Università e della Ricerca;	*
scopus.funding.ids	HP10CY9V7K; 2022EPTPJ9; CUP B53C22001770006; PNRR - PRIN 2022;	*
scopus.identifier.doi	10.18653/v1/2025.findings-naacl.371	*
scopus.identifier.isbn	9798891761957	*
scopus.identifier.pui	650048217	*
scopus.identifier.scopus	2-s2.0-105028681852	*
scopus.journal.sourceid	21101390114	*
scopus.language.iso	eng	*
scopus.publisher.name	Association for Computational Linguistics (ACL)	*
scopus.relation.conferencedate	2025	*
scopus.relation.conferencename	2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025	*
scopus.relation.conferenceplace	usa	*
scopus.relation.firstpage	6661	*
scopus.relation.lastpage	6675	*
scopus.title	Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation	*
scopus.titleeng	Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025.findings-naacl.371.pdf accesso aperto Descrizione: Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 905 kB Formato Adobe PDF Visualizza/Apri	905 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/552066

Citazioni

ND

4

ND

social impact