CNR Institutional Research Information System

Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.

Italian word embeddings for the medical domain

Cardillo F. A.;Debole F.

2024

Abstract

Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.orgunit	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	en
dc.authority.people	Cardillo F. A.	en
dc.authority.people	Debole F.	en
dc.authority.project	corda__h2020::a0520286b7e6ad6e9c871d7ab8f2c196	en
dc.authority.project	corda__h2020::b9871e3e08a9db98aaa42bf321ed0f1a	en
dc.authority.project	corda_____he::86c21b1aa82d5bdc53411947d7ebd9f8	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	973	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2024/10/08 16:33:55	-
dc.date.available	2024/10/08 16:33:55	-
dc.date.firstsubmission	2024/10/06 21:13:17	*
dc.date.issued	2024	-
dc.date.submission	2025/02/27 10:59:57	*
dc.description.abstracteng	Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.	-
dc.description.allpeople	Cardillo, F. A.; Debole, F.	-
dc.description.allpeopleoriginal	Cardillo F.A.; Debole F.	en
dc.description.fulltext	open	en
dc.description.international	no	en
dc.description.numberofauthors	2	-
dc.identifier.isbn	978-2-493814-10-4	en
dc.identifier.isi	WOS:001612977900047	-
dc.identifier.scopus	2-s2.0-85195995799	en
dc.identifier.source	manual	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/505144	-
dc.identifier.url	https://aclanthology.org/2024.lrec-main.824	en
dc.language.iso	eng	en
dc.relation.conferencedate	20-25/05/2024	en
dc.relation.conferencename	LREC-COLING 2024 - 24th Joint International Conference on Computational Linguistics, Language Resources and Evaluation	en
dc.relation.conferenceplace	Torino, Italy	en
dc.relation.firstpage	9434	en
dc.relation.ispartofbook	Proceedings of the LREC-COLING 2024	en
dc.relation.lastpage	9440	en
dc.relation.medium	ELETTRONICO	en
dc.relation.numberofpages	7	en
dc.relation.projectAcronym	DeepHealth	en
dc.relation.projectAcronym	TAILOR	en
dc.relation.projectAcronym	STARWARS	en
dc.relation.projectAwardNumber	825111	en
dc.relation.projectAwardNumber	952215	en
dc.relation.projectAwardNumber	101086252	en
dc.relation.projectAwardTitle	Deep-Learning and HPC to Boost Biomedical Applications for Health	en
dc.relation.projectAwardTitle	Foundations of Trustworthy AI - Integrating Reasoning, Learning and Optimization	en
dc.relation.projectAwardTitle	STormwAteR and WastewAteR networkS heterogeneous data AI-driven management	en
dc.relation.projectFunderName	European Commission	en
dc.relation.projectFunderName	European Commission	en
dc.relation.projectFunderName	European Commission	en
dc.relation.projectFundingStream	Horizon 2020 Framework Programme	en
dc.relation.projectFundingStream	Horizon 2020 Framework Programme	en
dc.relation.projectFundingStream	Horizon Europe Framework Programme	en
dc.subject.keywordseng	NLP	-
dc.subject.keywordseng	Distributed Representations	-
dc.subject.singlekeyword	NLP	*
dc.subject.singlekeyword	Distributed Representations	*
dc.title	Italian word embeddings for the medical domain	en
dc.type.circulation	Internazionale	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.invited	contributo	en
dc.type.miur	273	-
dc.type.referee	Comitato scientifico	en
iris.isi.ideLinkStatusDate	2026/06/10 17:03:09	*
iris.isi.ideLinkStatusMillisecond	1781103789149	*
iris.isi.metadataErrorDescription	0	-
iris.isi.metadataErrorType	ERROR_NO_MATCH	-
iris.isi.metadataStatus	ERROR	-
iris.mediafilter.data	2025/04/06 02:55:22	*
iris.orcid.lastModifiedDate	2026/06/10 17:03:09	*
iris.orcid.lastModifiedMillisecond	1781103789121	*
iris.scopus.extIssued	2024	-
iris.scopus.extTitle	Italian Word Embeddings for the Medical Domain	-
iris.sitodocente.maxattempts	1	-
scopus.category	2614	*
scopus.category	1706	*
scopus.category	1703	*
scopus.contributor.affiliation	Institute of Information Science and Technologies Consiglio Nazionale delle Ricerche	-
scopus.contributor.affiliation	Institute of Information Science and Technologies Consiglio Nazionale delle Ricerche	-
scopus.contributor.afid	60008218	-
scopus.contributor.afid	60008218	-
scopus.contributor.auid	57191090133	-
scopus.contributor.auid	22333451000	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid	131356648	-
scopus.contributor.dptid	131356648	-
scopus.contributor.name	Franco Alberto	-
scopus.contributor.name	Franca	-
scopus.contributor.subaffiliation	Institute for Computational Linguistics;	-
scopus.contributor.subaffiliation	Institute for Computational Linguistics;	-
scopus.contributor.surname	Cardillo	-
scopus.contributor.surname	Debole	-
scopus.date.issued	2024	*
scopus.description.abstracteng	Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.	*
scopus.description.allpeopleoriginal	Cardillo F.A.; Debole F.	*
scopus.differences	scopus.relation.conferencename	*
scopus.differences	scopus.publisher.name	*
scopus.differences	scopus.relation.conferencedate	*
scopus.differences	scopus.identifier.isbn	*
scopus.differences	scopus.relation.conferenceplace	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.funding.funders	501100004462 - Consiglio Nazionale delle Ricerche; 100010661 - Horizon 2020 Framework Programme;	*
scopus.funding.ids	101086252;	*
scopus.identifier.isbn	9782493814104	*
scopus.identifier.pui	644494424	*
scopus.identifier.scopus	2-s2.0-85195995799	*
scopus.journal.sourceid	21101227955	*
scopus.language.iso	eng	*
scopus.publisher.name	European Language Resources Association (ELRA)	*
scopus.relation.conferencedate	2024	*
scopus.relation.conferencename	Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024	*
scopus.relation.conferenceplace	ita	*
scopus.relation.firstpage	9434	*
scopus.relation.lastpage	9440	*
scopus.title	Italian Word Embeddings for the Medical Domain	*
scopus.titleeng	Italian Word Embeddings for the Medical Domain	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2024.lrec-main.824-2.pdf accesso aperto Descrizione: Paper in proceedings Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 271.43 kB Formato Adobe PDF Visualizza/Apri	271.43 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/505144

Citazioni

ND

0

0

social impact