CNR Institutional Research Information System

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.

Doctor, Is That You? Evaluating Large Language Models on Italy’s Medical School Entrance Exams

Piperno, Ruben;Bonfigli, Agnese;Dell'Orletta, Felice;Pecchia, Leandro;Merone, Mario;Bacco Luca

2025

Abstract

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Piperno, Ruben	en
dc.authority.people	Bonfigli, Agnese	en
dc.authority.people	Dell'Orletta, Felice	en
dc.authority.people	Pecchia, Leandro	en
dc.authority.people	Merone, Mario	en
dc.authority.people	Bacco Luca	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2026/03/03 17:32:06	-
dc.date.available	2026/03/03 17:32:06	-
dc.date.firstsubmission	2026/03/03 17:08:00	*
dc.date.issued	2025	-
dc.date.submission	2026/03/03 17:08:00	*
dc.description.abstract	In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe’s most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the official admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.	-
dc.description.allpeople	Piperno, Ruben; Bonfigli, Agnese; Dell'Orletta, Felice; Pecchia, Leandro; Merone, Mario; Bacco, Luca	-
dc.description.allpeopleoriginal	Piperno, Ruben; Bonfigli, Agnese; Dell'Orletta, Felice; Pecchia, Leandro; Merone, Mario; Bacco Luca	en
dc.description.fulltext	open	en
dc.description.numberofauthors	6	-
dc.identifier.source	manual	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/570762	-
dc.language.iso	eng	en
dc.relation.ispartofbook	Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)	en
dc.subject.keywords	Large Language Models (LLMs)	-
dc.subject.keywords	Italian Medical Admission Test	-
dc.subject.keywords	NLP in healthcare	-
dc.subject.singlekeyword	Large Language Models (LLMs)	*
dc.subject.singlekeyword	Italian Medical Admission Test	*
dc.subject.singlekeyword	NLP in healthcare	*
dc.title	Doctor, Is That You? Evaluating Large Language Models on Italy’s Medical School Entrance Exams	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
iris.mediafilter.data	2026/03/04 02:52:10	*
iris.orcid.lastModifiedDate	2026/03/03 17:32:06	*
iris.orcid.lastModifiedMillisecond	1772555526084	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
84_main_long.pdf accesso aperto Licenza: Creative commons Dimensione 439.23 kB Formato Adobe PDF Visualizza/Apri	439.23 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570762

Citazioni

ND

ND

ND

social impact