CNR Institutional Research Information System

We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1

All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

Testa D.;Bonetta G.;Bernardi R.;Bondielli A.;Lenci A.;Miaschi A.;Passaro L.;Magnini B.

2025

Abstract

We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Testa D.	en
dc.authority.people	Bonetta G.	en
dc.authority.people	Bernardi R.	en
dc.authority.people	Bondielli A.	en
dc.authority.people	Lenci A.	en
dc.authority.people	Miaschi A.	en
dc.authority.people	Passaro L.	en
dc.authority.people	Magnini B.	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	ASR - Unità Contratti di lavoro	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	1181	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2026/03/03 16:49:55	-
dc.date.available	2026/03/03 16:49:55	-
dc.date.firstsubmission	2026/03/03 15:41:02	*
dc.date.issued	2025	-
dc.date.submission	2026/03/03 15:41:02	*
dc.description.abstracteng	We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1	-
dc.description.allpeople	Testa, D.; Bonetta, G.; Bernardi, R.; Bondielli, A.; Lenci, A.; Miaschi, A.; Passaro, L.; Magnini, B.	-
dc.description.allpeopleoriginal	Testa D.; Bonetta G.; Bernardi R.; Bondielli A.; Lenci A.; Miaschi A.; Passaro L.; Magnini B.	en
dc.description.fulltext	open	en
dc.description.numberofauthors	8	-
dc.identifier.doi	10.18653/v1/2025.findings-emnlp.1091	en
dc.identifier.scopus	2-s2.0-105028947199	en
dc.identifier.source	scopus	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/570743	-
dc.language.iso	eng	en
dc.relation.firstpage	20030	en
dc.relation.ispartofbook	Findings of the Association for Computational Linguistics: EMNLP 2025	en
dc.relation.lastpage	20050	en
dc.relation.numberofpages	21	en
dc.subject.keywordseng	multimodal, vllm, multimodal reasoning	-
dc.subject.singlekeyword	multimodal	*
dc.subject.singlekeyword	vllm	*
dc.subject.singlekeyword	multimodal reasoning	*
dc.title	All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
iris.mediafilter.data	2026/03/04 02:52:09	*
iris.orcid.lastModifiedDate	2026/03/03 16:49:55	*
iris.orcid.lastModifiedMillisecond	1772552995601	*
iris.scopus.extIssued	2025	-
iris.scopus.extTitle	All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.18653/v1/2025.findings-emnlp.1091	*
iris.unpaywall.isoa	true	*
iris.unpaywall.journalisindoaj	false	*
iris.unpaywall.landingpage	https://doi.org/10.18653/v1/2025.findings-emnlp.1091	*
iris.unpaywall.license	cc-by	*
iris.unpaywall.metadataCallLastModified	04/03/2026 04:34:39	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1772595279066	-
iris.unpaywall.oastatus	gold	*
iris.unpaywall.pdfurl	https://aclanthology.org/2025.findings-emnlp.1091.pdf	*
scopus.category	1710	*
scopus.category	3310	*
scopus.category	1706	*
scopus.category	1703	*
scopus.contributor.affiliation	Fondazione Bruno Kessler (FBK)	-
scopus.contributor.affiliation	Fondazione Bruno Kessler (FBK)	-
scopus.contributor.affiliation	Free University of Bozen-Bolzano	-
scopus.contributor.affiliation	University of Pisa	-
scopus.contributor.affiliation	University of Pisa	-
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.affiliation	University of Pisa	-
scopus.contributor.affiliation	Fondazione Bruno Kessler (FBK)	-
scopus.contributor.afid	60083112	-
scopus.contributor.afid	60083112	-
scopus.contributor.afid	60009914	-
scopus.contributor.afid	60028868	-
scopus.contributor.afid	60028868	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60028868	-
scopus.contributor.afid	60083112	-
scopus.contributor.auid	58711995500	-
scopus.contributor.auid	57216831819	-
scopus.contributor.auid	57189506691	-
scopus.contributor.auid	57192938063	-
scopus.contributor.auid	8286541500	-
scopus.contributor.auid	57211678681	-
scopus.contributor.auid	57192941166	-
scopus.contributor.auid	22433254600	-
scopus.contributor.country		-
scopus.contributor.country		-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid	114510308	-
scopus.contributor.dptid	114510308	-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid	109696702	-
scopus.contributor.dptid		-
scopus.contributor.name	Davide	-
scopus.contributor.name	Giovanni	-
scopus.contributor.name	Raffaella	-
scopus.contributor.name	Alessandro	-
scopus.contributor.name	Alessandro	-
scopus.contributor.name	Alessio	-
scopus.contributor.name	Lucia	-
scopus.contributor.name	Bernardo	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation	CoLing Lab;Dept. of Philology;Literature and Linguistics;	-
scopus.contributor.subaffiliation	CoLing Lab;Dept. of Philology;Literature and Linguistics;	-
scopus.contributor.subaffiliation	Istituto di Linguistica Computazionale "A. Zampolli" (CNR-ILC);	-
scopus.contributor.subaffiliation	Dept. of Computer Science;	-
scopus.contributor.subaffiliation		-
scopus.contributor.surname	Testa	-
scopus.contributor.surname	Bonetta	-
scopus.contributor.surname	Bernardi	-
scopus.contributor.surname	Bondielli	-
scopus.contributor.surname	Lenci	-
scopus.contributor.surname	Miaschi	-
scopus.contributor.surname	Passaro	-
scopus.contributor.surname	Magnini	-
scopus.date.issued	2025	*
scopus.description.abstracteng	We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers.	*
scopus.description.allpeopleoriginal	Testa D.; Bonetta G.; Bernardi R.; Bondielli A.; Lenci A.; Miaschi A.; Passaro L.; Magnini B.	*
scopus.differences	scopus.publisher.name	*
scopus.differences	scopus.relation.conferencedate	*
scopus.differences	scopus.description.abstracteng	*
scopus.differences	scopus.relation.conferencename	*
scopus.differences	scopus.identifier.isbn	*
scopus.differences	scopus.relation.conferenceplace	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.funding.funders	501100004271 - Sapienza Università di Roma; 501100000780 - European Commission; 100018703 - HORIZON EUROPE European Innovation Council; 100018703 - HORIZON EUROPE European Innovation Council; 501100021856 - Ministero dell'Università e della Ricerca; 501100021856 - Ministero dell'Università e della Ricerca;	*
scopus.funding.ids	101070918; DM MUR 1062/2021;	*
scopus.identifier.doi	10.18653/v1/2025.findings-emnlp.1091	*
scopus.identifier.isbn	9798891763357	*
scopus.identifier.pui	650082362	*
scopus.identifier.scopus	2-s2.0-105028947199	*
scopus.journal.sourceid	21101390612	*
scopus.language.iso	eng	*
scopus.publisher.name	Association for Computational Linguistics (ACL)	*
scopus.relation.conferencedate	2025	*
scopus.relation.conferencename	30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025	*
scopus.relation.conferenceplace	chn	*
scopus.relation.firstpage	20030	*
scopus.relation.lastpage	20050	*
scopus.title	All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark	*
scopus.titleeng	All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025.findings-emnlp.1091.pdf accesso aperto Licenza: Creative commons Dimensione 2.12 MB Formato Adobe PDF Visualizza/Apri	2.12 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570743

Citazioni

ND

1

ND

social impact