CNR Institutional Research Information System

With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.

Cascaded transformer-based networks for Wikipedia large-scale image-caption matching

Messina N;Coccomini DA;Esuli A;Falchi F

2024

Abstract

With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Multi-modal matching
Information retrieval
Deep learning
Transformer networks
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_491916-doc_205202.pdf accesso aperto Descrizione: Cascaded transformer-based networks for Wikipedia large-scale image-caption matching Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 1.29 MB Formato Adobe PDF Visualizza/Apri	1.29 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/453529

Citazioni

ND

0

ND

social impact