CNR Institutional Research Information System

Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.

Cross-lingual distillation for domain knowledge transfer with sentence transformers

Piperno R.;Bacco L.;Dell'Orletta F.;Merone M.;Pecchia L.

2025

Abstract

Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.ancejournal	KNOWLEDGE-BASED SYSTEMS	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Piperno R.	en
dc.authority.people	Bacco L.	en
dc.authority.people	Dell'Orletta F.	en
dc.authority.people	Merone M.	en
dc.authority.people	Pecchia L.	en
dc.collection.id.s	b3f88f24-048a-4e43-8ab1-6697b90e068e	*
dc.collection.name	01.01 Articolo in rivista	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2026/03/03 15:03:33	-
dc.date.available	2026/03/03 15:03:33	-
dc.date.firstsubmission	2026/03/02 18:41:58	*
dc.date.issued	2025	-
dc.date.submission	2026/03/02 18:41:58	*
dc.description.abstracteng	Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.	-
dc.description.allpeople	Piperno, R.; Bacco, L.; Dell'Orletta, F.; Merone, M.; Pecchia, L.	-
dc.description.allpeopleoriginal	Piperno R.; Bacco L.; Dell'Orletta F.; Merone M.; Pecchia L.	en
dc.description.fulltext	open	en
dc.description.international	no	en
dc.description.numberofauthors	5	-
dc.identifier.doi	10.1016/j.knosys.2025.113079	en
dc.identifier.isi	WOS:001424671900001	-
dc.identifier.scopus	2-s2.0-85217037543	en
dc.identifier.source	scopus	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/570481	-
dc.language.iso	eng	en
dc.relation.volume	311	en
dc.subject.keywordseng	Biomedical domain	-
dc.subject.keywordseng	Cross-lingual learning	-
dc.subject.keywordseng	Domain adaptation	-
dc.subject.keywordseng	Knowledge distillation	-
dc.subject.keywordseng	Sentence transformers	-
dc.subject.singlekeyword	Biomedical domain	*
dc.subject.singlekeyword	Cross-lingual learning	*
dc.subject.singlekeyword	Domain adaptation	*
dc.subject.singlekeyword	Knowledge distillation	*
dc.subject.singlekeyword	Sentence transformers	*
dc.title	Cross-lingual distillation for domain knowledge transfer with sentence transformers	en
dc.type.driver	info:eu-repo/semantics/article	-
dc.type.full	01 Contributo su Rivista::01.01 Articolo in rivista	it
dc.type.miur	262	-
iris.isi.extIssued	2025	-
iris.isi.extTitle	Cross-lingual distillation for domain knowledge transfer with sentence transformers	-
iris.mediafilter.data	2026/03/04 02:52:21	*
iris.orcid.lastModifiedDate	2026/03/04 01:09:50	*
iris.orcid.lastModifiedMillisecond	1772582990656	*
iris.scopus.extIssued	2025	-
iris.scopus.extTitle	Cross-lingual distillation for domain knowledge transfer with sentence transformers	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoahost	publisher	*
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.1016/j.knosys.2025.113079	*
iris.unpaywall.hosttype	publisher	*
iris.unpaywall.isoa	true	*
iris.unpaywall.journalisindoaj	false	*
iris.unpaywall.landingpage	https://doi.org/10.1016/j.knosys.2025.113079	*
iris.unpaywall.license	cc-by	*
iris.unpaywall.metadataCallLastModified	04/03/2026 04:34:02	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1772595242121	-
iris.unpaywall.oastatus	hybrid	*
isi.authority.ancejournal	KNOWLEDGE-BASED SYSTEMS###0950-7051	*
isi.category	EP	*
isi.contributor.affiliation	Consiglio Nazionale delle Ricerche (CNR)	-
isi.contributor.affiliation	Consiglio Nazionale delle Ricerche (CNR)	-
isi.contributor.affiliation	Consiglio Nazionale delle Ricerche (CNR)	-
isi.contributor.affiliation	University Campus Bio-Medico - Rome Italy	-
isi.contributor.affiliation	University Campus Bio-Medico - Rome Italy	-
isi.contributor.country	Italy	-
isi.contributor.country	Italy	-
isi.contributor.country	Italy	-
isi.contributor.country	Italy	-
isi.contributor.country	Italy	-
isi.contributor.name	Ruben	-
isi.contributor.name	Luca	-
isi.contributor.name	Felice	-
isi.contributor.name	Mario	-
isi.contributor.name	Leandro	-
isi.contributor.researcherId	MGL-9541-2025	-
isi.contributor.researcherId	AHA-7493-2022	-
isi.contributor.researcherId	AAX-1864-2020	-
isi.contributor.researcherId	AAA-8945-2019	-
isi.contributor.researcherId	AAF-7325-2019	-
isi.contributor.subaffiliation	Natl Res Council	-
isi.contributor.subaffiliation	Natl Res Council	-
isi.contributor.subaffiliation	Natl Res Council	-
isi.contributor.subaffiliation	Dept Engn	-
isi.contributor.subaffiliation	Dept Engn	-
isi.contributor.surname	Piperno	-
isi.contributor.surname	Bacco	-
isi.contributor.surname	Dell'Orletta	-
isi.contributor.surname	Merone	-
isi.contributor.surname	Pecchia	-
isi.date.issued	2025	*
isi.description.abstracteng	Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.	*
isi.description.allpeopleoriginal	Piperno, R; Bacco, L; Dell'Orletta, F; Merone, M; Pecchia, L;	*
isi.document.sourcetype	WOS.SCI	*
isi.document.type	Article	*
isi.document.types	Article	*
isi.identifier.doi	10.1016/j.knosys.2025.113079	*
isi.identifier.eissn	1872-7409	*
isi.identifier.isi	WOS:001424671900001	*
isi.journal.journaltitle	KNOWLEDGE-BASED SYSTEMS	*
isi.journal.journaltitleabbrev	KNOWL-BASED SYST	*
isi.language.original	English	*
isi.publisher.place	RADARWEG 29, 1043 NX AMSTERDAM, NETHERLANDS	*
isi.relation.volume	311	*
isi.title	Cross-lingual distillation for domain knowledge transfer with sentence transformers	*
scopus.authority.ancejournal	KNOWLEDGE-BASED SYSTEMS###0950-7051	*
scopus.category	1404	*
scopus.category	1712	*
scopus.category	1802	*
scopus.category	1702	*
scopus.contributor.affiliation	Università Campus Bio-Medico di Roma	-
scopus.contributor.affiliation	Università Campus Bio-Medico di Roma	-
scopus.contributor.affiliation	National Research Council	-
scopus.contributor.affiliation	Università Campus Bio-Medico di Roma	-
scopus.contributor.affiliation	Fondazione Policlinico Universitario Campus Bio-Medico di Roma	-
scopus.contributor.afid	60005308	-
scopus.contributor.afid	60005308	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60005308	-
scopus.contributor.afid	60276021	-
scopus.contributor.auid	59544561300	-
scopus.contributor.auid	57220927387	-
scopus.contributor.auid	57540567000	-
scopus.contributor.auid	56102657200	-
scopus.contributor.auid	35746897300	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid	116307659	-
scopus.contributor.dptid	116307659	-
scopus.contributor.dptid	121833164	-
scopus.contributor.dptid	116307659	-
scopus.contributor.dptid		-
scopus.contributor.name	Ruben	-
scopus.contributor.name	Luca	-
scopus.contributor.name	Felice	-
scopus.contributor.name	Mario	-
scopus.contributor.name	Leandro	-
scopus.contributor.subaffiliation	Research Unit of Intelligent Technology for Health and Wellbeing;Department of Engineering;	-
scopus.contributor.subaffiliation	Research Unit of Computer Systems and Bioinformatics;Department of Engineering;	-
scopus.contributor.subaffiliation	ItaliaNLP Lab;Institute of Computational Linguistics ”Antonio Zampolli”;	-
scopus.contributor.subaffiliation	Research Unit of Intelligent Technology for Health and Wellbeing;Department of Engineering;	-
scopus.contributor.subaffiliation		-
scopus.contributor.surname	Piperno	-
scopus.contributor.surname	Bacco	-
scopus.contributor.surname	Dell'Orletta	-
scopus.contributor.surname	Merone	-
scopus.contributor.surname	Pecchia	-
scopus.date.issued	2025	*
scopus.description.abstracteng	Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.	*
scopus.description.allpeopleoriginal	Piperno R.; Bacco L.; Dell'Orletta F.; Merone M.; Pecchia L.	*
scopus.differences	scopus.subject.keywords	*
scopus.document.type	ar	*
scopus.document.types	ar	*
scopus.identifier.doi	10.1016/j.knosys.2025.113079	*
scopus.identifier.pui	2037351665	*
scopus.identifier.scopus	2-s2.0-85217037543	*
scopus.journal.sourceid	24772	*
scopus.language.iso	eng	*
scopus.publisher.name	Elsevier B.V.	*
scopus.relation.article	113079	*
scopus.relation.volume	311	*
scopus.subject.keywords	Biomedical domain; Cross-lingual learning; Domain adaptation; Knowledge distillation; Sentence transformers;	*
scopus.title	Cross-lingual distillation for domain knowledge transfer with sentence transformers	*
scopus.titleeng	Cross-lingual distillation for domain knowledge transfer with sentence transformers	*
Appare nelle tipologie:	01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0950705125001261-main.pdf accesso aperto Licenza: Creative commons Dimensione 2.48 MB Formato Adobe PDF Visualizza/Apri	2.48 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570481

Citazioni

ND

11

7

social impact