We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark
Lenci A.;Miaschi A.;
2025
Abstract
We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Testa D. | en |
| dc.authority.people | Bonetta G. | en |
| dc.authority.people | Bernardi R. | en |
| dc.authority.people | Bondielli A. | en |
| dc.authority.people | Lenci A. | en |
| dc.authority.people | Miaschi A. | en |
| dc.authority.people | Passaro L. | en |
| dc.authority.people | Magnini B. | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | ASR - Unità Contratti di lavoro | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.appartenenza.mi | 1181 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2026/03/03 16:49:55 | - |
| dc.date.available | 2026/03/03 16:49:55 | - |
| dc.date.firstsubmission | 2026/03/03 15:41:02 | * |
| dc.date.issued | 2025 | - |
| dc.date.submission | 2026/03/03 15:41:02 | * |
| dc.description.abstracteng | We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an openended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. 1 | - |
| dc.description.allpeople | Testa, D.; Bonetta, G.; Bernardi, R.; Bondielli, A.; Lenci, A.; Miaschi, A.; Passaro, L.; Magnini, B. | - |
| dc.description.allpeopleoriginal | Testa D.; Bonetta G.; Bernardi R.; Bondielli A.; Lenci A.; Miaschi A.; Passaro L.; Magnini B. | en |
| dc.description.fulltext | open | en |
| dc.description.numberofauthors | 8 | - |
| dc.identifier.doi | 10.18653/v1/2025.findings-emnlp.1091 | en |
| dc.identifier.scopus | 2-s2.0-105028947199 | en |
| dc.identifier.source | scopus | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/570743 | - |
| dc.language.iso | eng | en |
| dc.relation.firstpage | 20030 | en |
| dc.relation.ispartofbook | Findings of the Association for Computational Linguistics: EMNLP 2025 | en |
| dc.relation.lastpage | 20050 | en |
| dc.relation.numberofpages | 21 | en |
| dc.subject.keywordseng | multimodal, vllm, multimodal reasoning | - |
| dc.subject.singlekeyword | multimodal | * |
| dc.subject.singlekeyword | vllm | * |
| dc.subject.singlekeyword | multimodal reasoning | * |
| dc.title | All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| iris.mediafilter.data | 2026/03/04 02:52:09 | * |
| iris.orcid.lastModifiedDate | 2026/03/03 16:49:55 | * |
| iris.orcid.lastModifiedMillisecond | 1772552995601 | * |
| iris.scopus.extIssued | 2025 | - |
| iris.scopus.extTitle | All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark | - |
| iris.sitodocente.maxattempts | 1 | - |
| iris.unpaywall.bestoaversion | publishedVersion | * |
| iris.unpaywall.doi | 10.18653/v1/2025.findings-emnlp.1091 | * |
| iris.unpaywall.isoa | true | * |
| iris.unpaywall.journalisindoaj | false | * |
| iris.unpaywall.landingpage | https://doi.org/10.18653/v1/2025.findings-emnlp.1091 | * |
| iris.unpaywall.license | cc-by | * |
| iris.unpaywall.metadataCallLastModified | 04/03/2026 04:34:39 | - |
| iris.unpaywall.metadataCallLastModifiedMillisecond | 1772595279066 | - |
| iris.unpaywall.oastatus | gold | * |
| iris.unpaywall.pdfurl | https://aclanthology.org/2025.findings-emnlp.1091.pdf | * |
| scopus.category | 1710 | * |
| scopus.category | 3310 | * |
| scopus.category | 1706 | * |
| scopus.category | 1703 | * |
| scopus.contributor.affiliation | Fondazione Bruno Kessler (FBK) | - |
| scopus.contributor.affiliation | Fondazione Bruno Kessler (FBK) | - |
| scopus.contributor.affiliation | Free University of Bozen-Bolzano | - |
| scopus.contributor.affiliation | University of Pisa | - |
| scopus.contributor.affiliation | University of Pisa | - |
| scopus.contributor.affiliation | ItaliaNLP Lab | - |
| scopus.contributor.affiliation | University of Pisa | - |
| scopus.contributor.affiliation | Fondazione Bruno Kessler (FBK) | - |
| scopus.contributor.afid | 60083112 | - |
| scopus.contributor.afid | 60083112 | - |
| scopus.contributor.afid | 60009914 | - |
| scopus.contributor.afid | 60028868 | - |
| scopus.contributor.afid | 60028868 | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.afid | 60028868 | - |
| scopus.contributor.afid | 60083112 | - |
| scopus.contributor.auid | 58711995500 | - |
| scopus.contributor.auid | 57216831819 | - |
| scopus.contributor.auid | 57189506691 | - |
| scopus.contributor.auid | 57192938063 | - |
| scopus.contributor.auid | 8286541500 | - |
| scopus.contributor.auid | 57211678681 | - |
| scopus.contributor.auid | 57192941166 | - |
| scopus.contributor.auid | 22433254600 | - |
| scopus.contributor.country | - | |
| scopus.contributor.country | - | |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | 114510308 | - |
| scopus.contributor.dptid | 114510308 | - |
| scopus.contributor.dptid | 114087935 | - |
| scopus.contributor.dptid | 109696702 | - |
| scopus.contributor.dptid | - | |
| scopus.contributor.name | Davide | - |
| scopus.contributor.name | Giovanni | - |
| scopus.contributor.name | Raffaella | - |
| scopus.contributor.name | Alessandro | - |
| scopus.contributor.name | Alessandro | - |
| scopus.contributor.name | Alessio | - |
| scopus.contributor.name | Lucia | - |
| scopus.contributor.name | Bernardo | - |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | CoLing Lab;Dept. of Philology;Literature and Linguistics; | - |
| scopus.contributor.subaffiliation | CoLing Lab;Dept. of Philology;Literature and Linguistics; | - |
| scopus.contributor.subaffiliation | Istituto di Linguistica Computazionale "A. Zampolli" (CNR-ILC); | - |
| scopus.contributor.subaffiliation | Dept. of Computer Science; | - |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.surname | Testa | - |
| scopus.contributor.surname | Bonetta | - |
| scopus.contributor.surname | Bernardi | - |
| scopus.contributor.surname | Bondielli | - |
| scopus.contributor.surname | Lenci | - |
| scopus.contributor.surname | Miaschi | - |
| scopus.contributor.surname | Passaro | - |
| scopus.contributor.surname | Magnini | - |
| scopus.date.issued | 2025 | * |
| scopus.description.abstracteng | We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers. | * |
| scopus.description.allpeopleoriginal | Testa D.; Bonetta G.; Bernardi R.; Bondielli A.; Lenci A.; Miaschi A.; Passaro L.; Magnini B. | * |
| scopus.differences | scopus.publisher.name | * |
| scopus.differences | scopus.relation.conferencedate | * |
| scopus.differences | scopus.description.abstracteng | * |
| scopus.differences | scopus.relation.conferencename | * |
| scopus.differences | scopus.identifier.isbn | * |
| scopus.differences | scopus.relation.conferenceplace | * |
| scopus.document.type | cp | * |
| scopus.document.types | cp | * |
| scopus.funding.funders | 501100004271 - Sapienza Università di Roma; 501100000780 - European Commission; 100018703 - HORIZON EUROPE European Innovation Council; 100018703 - HORIZON EUROPE European Innovation Council; 501100021856 - Ministero dell'Università e della Ricerca; 501100021856 - Ministero dell'Università e della Ricerca; | * |
| scopus.funding.ids | 101070918; DM MUR 1062/2021; | * |
| scopus.identifier.doi | 10.18653/v1/2025.findings-emnlp.1091 | * |
| scopus.identifier.isbn | 9798891763357 | * |
| scopus.identifier.pui | 650082362 | * |
| scopus.identifier.scopus | 2-s2.0-105028947199 | * |
| scopus.journal.sourceid | 21101390612 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | Association for Computational Linguistics (ACL) | * |
| scopus.relation.conferencedate | 2025 | * |
| scopus.relation.conferencename | 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 | * |
| scopus.relation.conferenceplace | chn | * |
| scopus.relation.firstpage | 20030 | * |
| scopus.relation.lastpage | 20050 | * |
| scopus.title | All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark | * |
| scopus.titleeng | All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark | * |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
2025.findings-emnlp.1091.pdf
accesso aperto
Licenza:
Creative commons
Dimensione
2.12 MB
Formato
Adobe PDF
|
2.12 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


