With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.
Cascaded transformer-based networks for Wikipedia large-scale image-caption matching
Messina N;Coccomini DA;Esuli A;Falchi F
2024
Abstract
With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.File | Dimensione | Formato | |
---|---|---|---|
prod_491916-doc_205202.pdf
accesso aperto
Descrizione: Cascaded transformer-based networks for Wikipedia large-scale image-caption matching
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
1.29 MB
Formato
Adobe PDF
|
1.29 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.