CNR Institutional Research Information System

In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the (typically huge) image collection on which the search is performed. We propose various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6-fc7 layers of an AlexNet trained on ILSVRC12 and Places databases. The Text2Vis models we explore include (1) a relatively simple regressor network relying on a bag-of-words representation for the textual descriptors, (2) a deep recurrent network that is sensible to word order, and (3) a wide and deep model that combines a stacked LSTM deep network with a wide regressor network. We compare the models we propose with other search strategies, also including textual search methods that exploit state-of-the-art caption generation models to index the image collection.

Picture it in your mind: generating high level visual representations from textual descriptions

Carrara F;Esuli A;Fagni T;Falchi F;Moreo Fernandez A

2018

Abstract

In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the (typically huge) image collection on which the search is performed. We propose various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6-fc7 layers of an AlexNet trained on ILSVRC12 and Places databases. The Text2Vis models we explore include (1) a relatively simple regressor network relying on a bag-of-words representation for the textual descriptors, (2) a deep recurrent network that is sensible to word order, and (3) a wide and deep model that combines a stacked LSTM deep network with a wide regressor network. We compare the models we propose with other search strategies, also including textual search methods that exploit state-of-the-art caption generation models to index the image collection.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2018
			
	Strutture organizzative
	
				Istituto di informatica e telematica - IIT
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Image Retrieval
Cross-media Retrieval
Text Repr
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_384745-doc_185904.pdf accesso aperto Descrizione: Postprint - Picture it in your mind: generating high level visual representations from textual descriptions Tipologia: Versione Editoriale (PDF) Dimensione 7.1 MB Formato Adobe PDF Visualizza/Apri	7.1 MB	Adobe PDF	Visualizza/Apri
prod_384745-doc_185905.pdf solo utenti autorizzati Descrizione: Picture it in your mind: generating high level visual representations from textual descriptions Tipologia: Versione Editoriale (PDF) Dimensione 1.96 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.96 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/346792

Citazioni

ND

20

17

social impact