Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.
Italian word embeddings for the medical domain
Cardillo F. A.;Debole F.
2024
Abstract
Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.orgunit | Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI | en |
| dc.authority.people | Cardillo F. A. | en |
| dc.authority.people | Debole F. | en |
| dc.authority.project | corda__h2020::a0520286b7e6ad6e9c871d7ab8f2c196 | en |
| dc.authority.project | corda__h2020::b9871e3e08a9db98aaa42bf321ed0f1a | en |
| dc.authority.project | corda_____he::86c21b1aa82d5bdc53411947d7ebd9f8 | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.appartenenza.mi | 973 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2024/10/08 16:33:55 | - |
| dc.date.available | 2024/10/08 16:33:55 | - |
| dc.date.firstsubmission | 2024/10/06 21:13:17 | * |
| dc.date.issued | 2024 | - |
| dc.date.submission | 2025/02/27 10:59:57 | * |
| dc.description.abstracteng | Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing. | - |
| dc.description.allpeople | Cardillo, F. A.; Debole, F. | - |
| dc.description.allpeopleoriginal | Cardillo F.A.; Debole F. | en |
| dc.description.fulltext | open | en |
| dc.description.international | no | en |
| dc.description.numberofauthors | 2 | - |
| dc.identifier.isbn | 978-2-493814-10-4 | en |
| dc.identifier.scopus | 2-s2.0-85195995799 | en |
| dc.identifier.source | manual | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/505144 | - |
| dc.identifier.url | https://aclanthology.org/2024.lrec-main.824 | en |
| dc.language.iso | eng | en |
| dc.relation.conferencedate | 20-25/05/2024 | en |
| dc.relation.conferencename | LREC-COLING 2024 - 24th Joint International Conference on Computational Linguistics, Language Resources and Evaluation | en |
| dc.relation.conferenceplace | Torino, Italy | en |
| dc.relation.firstpage | 9434 | en |
| dc.relation.ispartofbook | Proceedings of the LREC-COLING 2024 | en |
| dc.relation.lastpage | 9440 | en |
| dc.relation.medium | ELETTRONICO | en |
| dc.relation.numberofpages | 7 | en |
| dc.relation.projectAcronym | DeepHealth | en |
| dc.relation.projectAcronym | TAILOR | en |
| dc.relation.projectAcronym | STARWARS | en |
| dc.relation.projectAwardNumber | 825111 | en |
| dc.relation.projectAwardNumber | 952215 | en |
| dc.relation.projectAwardNumber | 101086252 | en |
| dc.relation.projectAwardTitle | Deep-Learning and HPC to Boost Biomedical Applications for Health | en |
| dc.relation.projectAwardTitle | Foundations of Trustworthy AI - Integrating Reasoning, Learning and Optimization | en |
| dc.relation.projectAwardTitle | STormwAteR and WastewAteR networkS heterogeneous data AI-driven management | en |
| dc.relation.projectFunderName | European Commission | en |
| dc.relation.projectFunderName | European Commission | en |
| dc.relation.projectFunderName | European Commission | en |
| dc.relation.projectFundingStream | Horizon 2020 Framework Programme | en |
| dc.relation.projectFundingStream | Horizon 2020 Framework Programme | en |
| dc.relation.projectFundingStream | Horizon Europe Framework Programme | en |
| dc.subject.keywordseng | NLP | - |
| dc.subject.keywordseng | Distributed Representations | - |
| dc.subject.singlekeyword | NLP | * |
| dc.subject.singlekeyword | Distributed Representations | * |
| dc.title | Italian word embeddings for the medical domain | en |
| dc.type.circulation | Internazionale | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.invited | contributo | en |
| dc.type.miur | 273 | - |
| dc.type.referee | Comitato scientifico | en |
| iris.mediafilter.data | 2025/04/06 02:55:22 | * |
| iris.orcid.lastModifiedDate | 2025/02/27 17:25:02 | * |
| iris.orcid.lastModifiedMillisecond | 1740673502010 | * |
| iris.scopus.extIssued | 2024 | - |
| iris.scopus.extTitle | Italian Word Embeddings for the Medical Domain | - |
| iris.sitodocente.maxattempts | 1 | - |
| scopus.category | 2614 | * |
| scopus.category | 1706 | * |
| scopus.category | 1703 | * |
| scopus.contributor.affiliation | Institute of Information Science and Technologies Consiglio Nazionale delle Ricerche | - |
| scopus.contributor.affiliation | Institute of Information Science and Technologies Consiglio Nazionale delle Ricerche | - |
| scopus.contributor.afid | 60008218 | - |
| scopus.contributor.afid | 60008218 | - |
| scopus.contributor.auid | 57191090133 | - |
| scopus.contributor.auid | 22333451000 | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.dptid | 131356648 | - |
| scopus.contributor.dptid | 131356648 | - |
| scopus.contributor.name | Franco Alberto | - |
| scopus.contributor.name | Franca | - |
| scopus.contributor.subaffiliation | Institute for Computational Linguistics; | - |
| scopus.contributor.subaffiliation | Institute for Computational Linguistics; | - |
| scopus.contributor.surname | Cardillo | - |
| scopus.contributor.surname | Debole | - |
| scopus.date.issued | 2024 | * |
| scopus.description.abstracteng | Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing. | * |
| scopus.description.allpeopleoriginal | Cardillo F.A.; Debole F. | * |
| scopus.differences | scopus.relation.conferencename | * |
| scopus.differences | scopus.publisher.name | * |
| scopus.differences | scopus.relation.conferencedate | * |
| scopus.differences | scopus.identifier.isbn | * |
| scopus.differences | scopus.relation.conferenceplace | * |
| scopus.document.type | cp | * |
| scopus.document.types | cp | * |
| scopus.funding.funders | 501100004462 - Consiglio Nazionale delle Ricerche; 100010661 - Horizon 2020 Framework Programme; | * |
| scopus.funding.ids | 101086252; | * |
| scopus.identifier.isbn | 9782493814104 | * |
| scopus.identifier.pui | 644494424 | * |
| scopus.identifier.scopus | 2-s2.0-85195995799 | * |
| scopus.journal.sourceid | 21101227955 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | European Language Resources Association (ELRA) | * |
| scopus.relation.conferencedate | 2024 | * |
| scopus.relation.conferencename | Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 | * |
| scopus.relation.conferenceplace | ita | * |
| scopus.relation.firstpage | 9434 | * |
| scopus.relation.lastpage | 9440 | * |
| scopus.title | Italian Word Embeddings for the Medical Domain | * |
| scopus.titleeng | Italian Word Embeddings for the Medical Domain | * |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
2024.lrec-main.824-2.pdf
accesso aperto
Descrizione: Paper in proceedings
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
271.43 kB
Formato
Adobe PDF
|
271.43 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


