With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.

Cascaded transformer-based networks for Wikipedia large-scale image-caption matching

Messina N;Coccomini DA;Esuli A;Falchi F
2024

Abstract

With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.
2024
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Multi-modal matching
Information retrieval
Deep learning
Transformer networks
File in questo prodotto:
File Dimensione Formato  
prod_491916-doc_205202.pdf

accesso aperto

Descrizione: Cascaded transformer-based networks for Wikipedia large-scale image-caption matching
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 1.29 MB
Formato Adobe PDF
1.29 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/453529
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact