Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.

Italian word embeddings for the medical domain

Cardillo F. A.;Debole F.
2024

Abstract

Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.orgunit Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI en
dc.authority.people Cardillo F. A. en
dc.authority.people Debole F. en
dc.authority.project corda__h2020::a0520286b7e6ad6e9c871d7ab8f2c196 en
dc.authority.project corda__h2020::b9871e3e08a9db98aaa42bf321ed0f1a en
dc.authority.project corda_____he::86c21b1aa82d5bdc53411947d7ebd9f8 en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 973 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2024/10/08 16:33:55 -
dc.date.available 2024/10/08 16:33:55 -
dc.date.firstsubmission 2024/10/06 21:13:17 *
dc.date.issued 2024 -
dc.date.submission 2025/02/27 10:59:57 *
dc.description.abstracteng Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing. -
dc.description.allpeople Cardillo, F. A.; Debole, F. -
dc.description.allpeopleoriginal Cardillo F.A.; Debole F. en
dc.description.fulltext open en
dc.description.international no en
dc.description.numberofauthors 2 -
dc.identifier.isbn 978-2-493814-10-4 en
dc.identifier.scopus 2-s2.0-85195995799 en
dc.identifier.source manual *
dc.identifier.uri https://hdl.handle.net/20.500.14243/505144 -
dc.identifier.url https://aclanthology.org/2024.lrec-main.824 en
dc.language.iso eng en
dc.relation.conferencedate 20-25/05/2024 en
dc.relation.conferencename LREC-COLING 2024 - 24th Joint International Conference on Computational Linguistics, Language Resources and Evaluation en
dc.relation.conferenceplace Torino, Italy en
dc.relation.firstpage 9434 en
dc.relation.ispartofbook Proceedings of the LREC-COLING 2024 en
dc.relation.lastpage 9440 en
dc.relation.medium ELETTRONICO en
dc.relation.numberofpages 7 en
dc.relation.projectAcronym DeepHealth en
dc.relation.projectAcronym TAILOR en
dc.relation.projectAcronym STARWARS en
dc.relation.projectAwardNumber 825111 en
dc.relation.projectAwardNumber 952215 en
dc.relation.projectAwardNumber 101086252 en
dc.relation.projectAwardTitle Deep-Learning and HPC to Boost Biomedical Applications for Health en
dc.relation.projectAwardTitle Foundations of Trustworthy AI - Integrating Reasoning, Learning and Optimization en
dc.relation.projectAwardTitle STormwAteR and WastewAteR networkS heterogeneous data AI-driven management en
dc.relation.projectFunderName European Commission en
dc.relation.projectFunderName European Commission en
dc.relation.projectFunderName European Commission en
dc.relation.projectFundingStream Horizon 2020 Framework Programme en
dc.relation.projectFundingStream Horizon 2020 Framework Programme en
dc.relation.projectFundingStream Horizon Europe Framework Programme en
dc.subject.keywordseng NLP -
dc.subject.keywordseng Distributed Representations -
dc.subject.singlekeyword NLP *
dc.subject.singlekeyword Distributed Representations *
dc.title Italian word embeddings for the medical domain en
dc.type.circulation Internazionale en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.invited contributo en
dc.type.miur 273 -
dc.type.referee Comitato scientifico en
iris.mediafilter.data 2025/04/06 02:55:22 *
iris.orcid.lastModifiedDate 2025/02/27 17:25:02 *
iris.orcid.lastModifiedMillisecond 1740673502010 *
iris.scopus.extIssued 2024 -
iris.scopus.extTitle Italian Word Embeddings for the Medical Domain -
iris.sitodocente.maxattempts 1 -
scopus.category 2614 *
scopus.category 1706 *
scopus.category 1703 *
scopus.contributor.affiliation Institute of Information Science and Technologies Consiglio Nazionale delle Ricerche -
scopus.contributor.affiliation Institute of Information Science and Technologies Consiglio Nazionale delle Ricerche -
scopus.contributor.afid 60008218 -
scopus.contributor.afid 60008218 -
scopus.contributor.auid 57191090133 -
scopus.contributor.auid 22333451000 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid 131356648 -
scopus.contributor.dptid 131356648 -
scopus.contributor.name Franco Alberto -
scopus.contributor.name Franca -
scopus.contributor.subaffiliation Institute for Computational Linguistics; -
scopus.contributor.subaffiliation Institute for Computational Linguistics; -
scopus.contributor.surname Cardillo -
scopus.contributor.surname Debole -
scopus.date.issued 2024 *
scopus.description.abstracteng Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing. *
scopus.description.allpeopleoriginal Cardillo F.A.; Debole F. *
scopus.differences scopus.relation.conferencename *
scopus.differences scopus.publisher.name *
scopus.differences scopus.relation.conferencedate *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.relation.conferenceplace *
scopus.document.type cp *
scopus.document.types cp *
scopus.funding.funders 501100004462 - Consiglio Nazionale delle Ricerche; 100010661 - Horizon 2020 Framework Programme; *
scopus.funding.ids 101086252; *
scopus.identifier.isbn 9782493814104 *
scopus.identifier.pui 644494424 *
scopus.identifier.scopus 2-s2.0-85195995799 *
scopus.journal.sourceid 21101227955 *
scopus.language.iso eng *
scopus.publisher.name European Language Resources Association (ELRA) *
scopus.relation.conferencedate 2024 *
scopus.relation.conferencename Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 *
scopus.relation.conferenceplace ita *
scopus.relation.firstpage 9434 *
scopus.relation.lastpage 9440 *
scopus.title Italian Word Embeddings for the Medical Domain *
scopus.titleeng Italian Word Embeddings for the Medical Domain *
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
2024.lrec-main.824-2.pdf

accesso aperto

Descrizione: Paper in proceedings
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 271.43 kB
Formato Adobe PDF
271.43 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/505144
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact