Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.

Cross-lingual distillation for domain knowledge transfer with sentence transformers

Piperno R.;Dell'Orletta F.;Merone M.;Pecchia L.
2025

Abstract

Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.
Campo DC Valore Lingua
dc.authority.ancejournal KNOWLEDGE-BASED SYSTEMS en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Piperno R. en
dc.authority.people Bacco L. en
dc.authority.people Dell'Orletta F. en
dc.authority.people Merone M. en
dc.authority.people Pecchia L. en
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/03/03 15:03:33 -
dc.date.available 2026/03/03 15:03:33 -
dc.date.firstsubmission 2026/03/02 18:41:58 *
dc.date.issued 2025 -
dc.date.submission 2026/03/02 18:41:58 *
dc.description.abstracteng Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications. -
dc.description.allpeople Piperno, R.; Bacco, L.; Dell'Orletta, F.; Merone, M.; Pecchia, L. -
dc.description.allpeopleoriginal Piperno R.; Bacco L.; Dell'Orletta F.; Merone M.; Pecchia L. en
dc.description.fulltext open en
dc.description.international no en
dc.description.numberofauthors 5 -
dc.identifier.doi 10.1016/j.knosys.2025.113079 en
dc.identifier.isi WOS:001424671900001 -
dc.identifier.scopus 2-s2.0-85217037543 en
dc.identifier.source scopus *
dc.identifier.uri https://hdl.handle.net/20.500.14243/570481 -
dc.language.iso eng en
dc.relation.volume 311 en
dc.subject.keywordseng Biomedical domain -
dc.subject.keywordseng Cross-lingual learning -
dc.subject.keywordseng Domain adaptation -
dc.subject.keywordseng Knowledge distillation -
dc.subject.keywordseng Sentence transformers -
dc.subject.singlekeyword Biomedical domain *
dc.subject.singlekeyword Cross-lingual learning *
dc.subject.singlekeyword Domain adaptation *
dc.subject.singlekeyword Knowledge distillation *
dc.subject.singlekeyword Sentence transformers *
dc.title Cross-lingual distillation for domain knowledge transfer with sentence transformers en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.miur 262 -
iris.isi.extIssued 2025 -
iris.isi.extTitle Cross-lingual distillation for domain knowledge transfer with sentence transformers -
iris.mediafilter.data 2026/03/04 02:52:21 *
iris.orcid.lastModifiedDate 2026/03/04 01:09:50 *
iris.orcid.lastModifiedMillisecond 1772582990656 *
iris.scopus.extIssued 2025 -
iris.scopus.extTitle Cross-lingual distillation for domain knowledge transfer with sentence transformers -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoahost publisher *
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.1016/j.knosys.2025.113079 *
iris.unpaywall.hosttype publisher *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.landingpage https://doi.org/10.1016/j.knosys.2025.113079 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 04/03/2026 04:34:02 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1772595242121 -
iris.unpaywall.oastatus hybrid *
isi.authority.ancejournal KNOWLEDGE-BASED SYSTEMS###0950-7051 *
isi.category EP *
isi.contributor.affiliation Consiglio Nazionale delle Ricerche (CNR) -
isi.contributor.affiliation Consiglio Nazionale delle Ricerche (CNR) -
isi.contributor.affiliation Consiglio Nazionale delle Ricerche (CNR) -
isi.contributor.affiliation University Campus Bio-Medico - Rome Italy -
isi.contributor.affiliation University Campus Bio-Medico - Rome Italy -
isi.contributor.country Italy -
isi.contributor.country Italy -
isi.contributor.country Italy -
isi.contributor.country Italy -
isi.contributor.country Italy -
isi.contributor.name Ruben -
isi.contributor.name Luca -
isi.contributor.name Felice -
isi.contributor.name Mario -
isi.contributor.name Leandro -
isi.contributor.researcherId MGL-9541-2025 -
isi.contributor.researcherId AHA-7493-2022 -
isi.contributor.researcherId AAX-1864-2020 -
isi.contributor.researcherId AAA-8945-2019 -
isi.contributor.researcherId AAF-7325-2019 -
isi.contributor.subaffiliation Natl Res Council -
isi.contributor.subaffiliation Natl Res Council -
isi.contributor.subaffiliation Natl Res Council -
isi.contributor.subaffiliation Dept Engn -
isi.contributor.subaffiliation Dept Engn -
isi.contributor.surname Piperno -
isi.contributor.surname Bacco -
isi.contributor.surname Dell'Orletta -
isi.contributor.surname Merone -
isi.contributor.surname Pecchia -
isi.date.issued 2025 *
isi.description.abstracteng Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications. *
isi.description.allpeopleoriginal Piperno, R; Bacco, L; Dell'Orletta, F; Merone, M; Pecchia, L; *
isi.document.sourcetype WOS.SCI *
isi.document.type Article *
isi.document.types Article *
isi.identifier.doi 10.1016/j.knosys.2025.113079 *
isi.identifier.eissn 1872-7409 *
isi.identifier.isi WOS:001424671900001 *
isi.journal.journaltitle KNOWLEDGE-BASED SYSTEMS *
isi.journal.journaltitleabbrev KNOWL-BASED SYST *
isi.language.original English *
isi.publisher.place RADARWEG 29, 1043 NX AMSTERDAM, NETHERLANDS *
isi.relation.volume 311 *
isi.title Cross-lingual distillation for domain knowledge transfer with sentence transformers *
scopus.authority.ancejournal KNOWLEDGE-BASED SYSTEMS###0950-7051 *
scopus.category 1404 *
scopus.category 1712 *
scopus.category 1802 *
scopus.category 1702 *
scopus.contributor.affiliation Università Campus Bio-Medico di Roma -
scopus.contributor.affiliation Università Campus Bio-Medico di Roma -
scopus.contributor.affiliation National Research Council -
scopus.contributor.affiliation Università Campus Bio-Medico di Roma -
scopus.contributor.affiliation Fondazione Policlinico Universitario Campus Bio-Medico di Roma -
scopus.contributor.afid 60005308 -
scopus.contributor.afid 60005308 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60005308 -
scopus.contributor.afid 60276021 -
scopus.contributor.auid 59544561300 -
scopus.contributor.auid 57220927387 -
scopus.contributor.auid 57540567000 -
scopus.contributor.auid 56102657200 -
scopus.contributor.auid 35746897300 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid 116307659 -
scopus.contributor.dptid 116307659 -
scopus.contributor.dptid 121833164 -
scopus.contributor.dptid 116307659 -
scopus.contributor.dptid -
scopus.contributor.name Ruben -
scopus.contributor.name Luca -
scopus.contributor.name Felice -
scopus.contributor.name Mario -
scopus.contributor.name Leandro -
scopus.contributor.subaffiliation Research Unit of Intelligent Technology for Health and Wellbeing;Department of Engineering; -
scopus.contributor.subaffiliation Research Unit of Computer Systems and Bioinformatics;Department of Engineering; -
scopus.contributor.subaffiliation ItaliaNLP Lab;Institute of Computational Linguistics ”Antonio Zampolli”; -
scopus.contributor.subaffiliation Research Unit of Intelligent Technology for Health and Wellbeing;Department of Engineering; -
scopus.contributor.subaffiliation -
scopus.contributor.surname Piperno -
scopus.contributor.surname Bacco -
scopus.contributor.surname Dell'Orletta -
scopus.contributor.surname Merone -
scopus.contributor.surname Pecchia -
scopus.date.issued 2025 *
scopus.description.abstracteng Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications. *
scopus.description.allpeopleoriginal Piperno R.; Bacco L.; Dell'Orletta F.; Merone M.; Pecchia L. *
scopus.differences scopus.subject.keywords *
scopus.document.type ar *
scopus.document.types ar *
scopus.identifier.doi 10.1016/j.knosys.2025.113079 *
scopus.identifier.pui 2037351665 *
scopus.identifier.scopus 2-s2.0-85217037543 *
scopus.journal.sourceid 24772 *
scopus.language.iso eng *
scopus.publisher.name Elsevier B.V. *
scopus.relation.article 113079 *
scopus.relation.volume 311 *
scopus.subject.keywords Biomedical domain; Cross-lingual learning; Domain adaptation; Knowledge distillation; Sentence transformers; *
scopus.title Cross-lingual distillation for domain knowledge transfer with sentence transformers *
scopus.titleeng Cross-lingual distillation for domain knowledge transfer with sentence transformers *
Appare nelle tipologie: 01.01 Articolo in rivista
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0950705125001261-main.pdf

accesso aperto

Licenza: Creative commons
Dimensione 2.48 MB
Formato Adobe PDF
2.48 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570481
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? 5
social impact