In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.
Doctor, Is That You? Evaluating Large Language Models on Italy’s Medical School Entrance Exams
Piperno, Ruben;Bonfigli, Agnese;Dell'Orletta, Felice;Merone, Mario;
2025
Abstract
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Piperno, Ruben | en |
| dc.authority.people | Bonfigli, Agnese | en |
| dc.authority.people | Dell'Orletta, Felice | en |
| dc.authority.people | Pecchia, Leandro | en |
| dc.authority.people | Merone, Mario | en |
| dc.authority.people | Bacco Luca | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2026/03/03 17:32:06 | - |
| dc.date.available | 2026/03/03 17:32:06 | - |
| dc.date.firstsubmission | 2026/03/03 17:08:00 | * |
| dc.date.issued | 2025 | - |
| dc.date.submission | 2026/03/03 17:08:00 | * |
| dc.description.abstract | In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks. | - |
| dc.description.allpeople | Piperno, Ruben; Bonfigli, Agnese; Dell'Orletta, Felice; Pecchia, Leandro; Merone, Mario; Bacco, Luca | - |
| dc.description.allpeopleoriginal | Piperno, Ruben; Bonfigli, Agnese; Dell'Orletta, Felice; Pecchia, Leandro; Merone, Mario; Bacco Luca | en |
| dc.description.fulltext | open | en |
| dc.description.numberofauthors | 6 | - |
| dc.identifier.source | manual | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/570762 | - |
| dc.language.iso | eng | en |
| dc.relation.ispartofbook | Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) | en |
| dc.subject.keywords | Large Language Models (LLMs) | - |
| dc.subject.keywords | Italian Medical Admission Test | - |
| dc.subject.keywords | NLP in healthcare | - |
| dc.subject.singlekeyword | Large Language Models (LLMs) | * |
| dc.subject.singlekeyword | Italian Medical Admission Test | * |
| dc.subject.singlekeyword | NLP in healthcare | * |
| dc.title | Doctor, Is That You? Evaluating Large Language Models on Italy’s Medical School Entrance Exams | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| iris.mediafilter.data | 2026/03/04 02:52:10 | * |
| iris.orcid.lastModifiedDate | 2026/03/03 17:32:06 | * |
| iris.orcid.lastModifiedMillisecond | 1772555526084 | * |
| iris.sitodocente.maxattempts | 1 | - |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
84_main_long.pdf
accesso aperto
Licenza:
Creative commons
Dimensione
439.23 kB
Formato
Adobe PDF
|
439.23 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


