In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.

Doctor, Is That You? Evaluating Large Language Models on Italy’s Medical School Entrance Exams

Piperno, Ruben;Bonfigli, Agnese;Dell'Orletta, Felice;Merone, Mario;
2025

Abstract

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Piperno, Ruben en
dc.authority.people Bonfigli, Agnese en
dc.authority.people Dell'Orletta, Felice en
dc.authority.people Pecchia, Leandro en
dc.authority.people Merone, Mario en
dc.authority.people Bacco Luca en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/03/03 17:32:06 -
dc.date.available 2026/03/03 17:32:06 -
dc.date.firstsubmission 2026/03/03 17:08:00 *
dc.date.issued 2025 -
dc.date.submission 2026/03/03 17:08:00 *
dc.description.abstract In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks. -
dc.description.allpeople Piperno, Ruben; Bonfigli, Agnese; Dell'Orletta, Felice; Pecchia, Leandro; Merone, Mario; Bacco, Luca -
dc.description.allpeopleoriginal Piperno, Ruben; Bonfigli, Agnese; Dell'Orletta, Felice; Pecchia, Leandro; Merone, Mario; Bacco Luca en
dc.description.fulltext open en
dc.description.numberofauthors 6 -
dc.identifier.source manual *
dc.identifier.uri https://hdl.handle.net/20.500.14243/570762 -
dc.language.iso eng en
dc.relation.ispartofbook Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) en
dc.subject.keywords Large Language Models (LLMs) -
dc.subject.keywords Italian Medical Admission Test -
dc.subject.keywords NLP in healthcare -
dc.subject.singlekeyword Large Language Models (LLMs) *
dc.subject.singlekeyword Italian Medical Admission Test *
dc.subject.singlekeyword NLP in healthcare *
dc.title Doctor, Is That You? Evaluating Large Language Models on Italy’s Medical School Entrance Exams en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
iris.mediafilter.data 2026/03/04 02:52:10 *
iris.orcid.lastModifiedDate 2026/03/03 17:32:06 *
iris.orcid.lastModifiedMillisecond 1772555526084 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
84_main_long.pdf

accesso aperto

Licenza: Creative commons
Dimensione 439.23 kB
Formato Adobe PDF
439.23 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570762
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact