We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1

All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

Lenci A.;Miaschi A.;
2025

Abstract

We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Testa D. en
dc.authority.people Bonetta G. en
dc.authority.people Bernardi R. en
dc.authority.people Bondielli A. en
dc.authority.people Lenci A. en
dc.authority.people Miaschi A. en
dc.authority.people Passaro L. en
dc.authority.people Magnini B. en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza ASR - Unità Contratti di lavoro *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 1181 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/03/03 16:49:55 -
dc.date.available 2026/03/03 16:49:55 -
dc.date.firstsubmission 2026/03/03 15:41:02 *
dc.date.issued 2025 -
dc.date.submission 2026/03/03 15:41:02 *
dc.description.abstracteng We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1 -
dc.description.allpeople Testa, D.; Bonetta, G.; Bernardi, R.; Bondielli, A.; Lenci, A.; Miaschi, A.; Passaro, L.; Magnini, B. -
dc.description.allpeopleoriginal Testa D.; Bonetta G.; Bernardi R.; Bondielli A.; Lenci A.; Miaschi A.; Passaro L.; Magnini B. en
dc.description.fulltext open en
dc.description.numberofauthors 8 -
dc.identifier.doi 10.18653/v1/2025.findings-emnlp.1091 en
dc.identifier.scopus 2-s2.0-105028947199 en
dc.identifier.source scopus *
dc.identifier.uri https://hdl.handle.net/20.500.14243/570743 -
dc.language.iso eng en
dc.relation.firstpage 20030 en
dc.relation.ispartofbook Findings of the Association for Computational Linguistics: EMNLP 2025 en
dc.relation.lastpage 20050 en
dc.relation.numberofpages 21 en
dc.subject.keywordseng multimodal, vllm, multimodal reasoning -
dc.subject.singlekeyword multimodal *
dc.subject.singlekeyword vllm *
dc.subject.singlekeyword multimodal reasoning *
dc.title All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
iris.mediafilter.data 2026/03/04 02:52:09 *
iris.orcid.lastModifiedDate 2026/03/03 16:49:55 *
iris.orcid.lastModifiedMillisecond 1772552995601 *
iris.scopus.extIssued 2025 -
iris.scopus.extTitle All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.18653/v1/2025.findings-emnlp.1091 *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.landingpage https://doi.org/10.18653/v1/2025.findings-emnlp.1091 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 04/03/2026 04:34:39 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1772595279066 -
iris.unpaywall.oastatus gold *
iris.unpaywall.pdfurl https://aclanthology.org/2025.findings-emnlp.1091.pdf *
scopus.category 1710 *
scopus.category 3310 *
scopus.category 1706 *
scopus.category 1703 *
scopus.contributor.affiliation Fondazione Bruno Kessler (FBK) -
scopus.contributor.affiliation Fondazione Bruno Kessler (FBK) -
scopus.contributor.affiliation Free University of Bozen-Bolzano -
scopus.contributor.affiliation University of Pisa -
scopus.contributor.affiliation University of Pisa -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation University of Pisa -
scopus.contributor.affiliation Fondazione Bruno Kessler (FBK) -
scopus.contributor.afid 60083112 -
scopus.contributor.afid 60083112 -
scopus.contributor.afid 60009914 -
scopus.contributor.afid 60028868 -
scopus.contributor.afid 60028868 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60028868 -
scopus.contributor.afid 60083112 -
scopus.contributor.auid 58711995500 -
scopus.contributor.auid 57216831819 -
scopus.contributor.auid 57189506691 -
scopus.contributor.auid 57192938063 -
scopus.contributor.auid 8286541500 -
scopus.contributor.auid 57211678681 -
scopus.contributor.auid 57192941166 -
scopus.contributor.auid 22433254600 -
scopus.contributor.country -
scopus.contributor.country -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid 114510308 -
scopus.contributor.dptid 114510308 -
scopus.contributor.dptid 114087935 -
scopus.contributor.dptid 109696702 -
scopus.contributor.dptid -
scopus.contributor.name Davide -
scopus.contributor.name Giovanni -
scopus.contributor.name Raffaella -
scopus.contributor.name Alessandro -
scopus.contributor.name Alessandro -
scopus.contributor.name Alessio -
scopus.contributor.name Lucia -
scopus.contributor.name Bernardo -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation CoLing Lab;Dept. of Philology;Literature and Linguistics; -
scopus.contributor.subaffiliation CoLing Lab;Dept. of Philology;Literature and Linguistics; -
scopus.contributor.subaffiliation Istituto di Linguistica Computazionale "A. Zampolli" (CNR-ILC); -
scopus.contributor.subaffiliation Dept. of Computer Science; -
scopus.contributor.subaffiliation -
scopus.contributor.surname Testa -
scopus.contributor.surname Bonetta -
scopus.contributor.surname Bernardi -
scopus.contributor.surname Bondielli -
scopus.contributor.surname Lenci -
scopus.contributor.surname Miaschi -
scopus.contributor.surname Passaro -
scopus.contributor.surname Magnini -
scopus.date.issued 2025 *
scopus.description.abstracteng We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. *
scopus.description.allpeopleoriginal Testa D.; Bonetta G.; Bernardi R.; Bondielli A.; Lenci A.; Miaschi A.; Passaro L.; Magnini B. *
scopus.differences scopus.publisher.name *
scopus.differences scopus.relation.conferencedate *
scopus.differences scopus.description.abstracteng *
scopus.differences scopus.relation.conferencename *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.relation.conferenceplace *
scopus.document.type cp *
scopus.document.types cp *
scopus.funding.funders 501100004271 - Sapienza Università di Roma; 501100000780 - European Commission; 100018703 - HORIZON EUROPE European Innovation Council; 100018703 - HORIZON EUROPE European Innovation Council; 501100021856 - Ministero dell'Università e della Ricerca; 501100021856 - Ministero dell'Università e della Ricerca; *
scopus.funding.ids 101070918; DM MUR 1062/2021; *
scopus.identifier.doi 10.18653/v1/2025.findings-emnlp.1091 *
scopus.identifier.isbn 9798891763357 *
scopus.identifier.pui 650082362 *
scopus.identifier.scopus 2-s2.0-105028947199 *
scopus.journal.sourceid 21101390612 *
scopus.language.iso eng *
scopus.publisher.name Association for Computational Linguistics (ACL) *
scopus.relation.conferencedate 2025 *
scopus.relation.conferencename 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 *
scopus.relation.conferenceplace chn *
scopus.relation.firstpage 20030 *
scopus.relation.lastpage 20050 *
scopus.title All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark *
scopus.titleeng All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark *
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
2025.findings-emnlp.1091.pdf

accesso aperto

Licenza: Creative commons
Dimensione 2.12 MB
Formato Adobe PDF
2.12 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570743
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact