Feature-rich data, such as audio-video recordings, digital images, and results of scientific experiments, nowadays constitute the largest fraction of the massive data sets produced daily in the e-society. Content-based similarity search systems working on such data collections are rapidly growing in importance. Unfortunately, similarity search is in general very expensive and hardly scalable. In this paper we study the case of content-based image retrieval (CBIR) systems, and focus on the problem of increasing the throughput of a large-scale CBIR system that indexes a very large collection of digital images. By analyzing the query log of a real CBIR system available on the Web, we characterize the behavior of users who experience a novel search paradigm, where content-based similarity queries and text-based ones can easily be interleaved. We show that locality and self-similarity is present even in the stream of queries submitted to such a CBIR system. According to these results, we propose an effective way to exploit this locality, by means of a similarity caching system, which stores the results of recently/frequently submitted queries and associated results. Unlike traditional caching, the proposed cache can manage not only exact hits, but also approximate ones that are solved by similarity with respect to the result sets of past queries present in the cache. We evaluate extensively the proposed solution by using the real query stream recorded in the log and a collection of 100 millions of digital photographs. The high hit ratios and small average approximation error figures obtained demonstrate the effectiveness of the approach.

Similarity caching in large-scale image retrieval

Falchi F;Lucchese C;Perego R;Rabitti F
2012

Abstract

Feature-rich data, such as audio-video recordings, digital images, and results of scientific experiments, nowadays constitute the largest fraction of the massive data sets produced daily in the e-society. Content-based similarity search systems working on such data collections are rapidly growing in importance. Unfortunately, similarity search is in general very expensive and hardly scalable. In this paper we study the case of content-based image retrieval (CBIR) systems, and focus on the problem of increasing the throughput of a large-scale CBIR system that indexes a very large collection of digital images. By analyzing the query log of a real CBIR system available on the Web, we characterize the behavior of users who experience a novel search paradigm, where content-based similarity queries and text-based ones can easily be interleaved. We show that locality and self-similarity is present even in the stream of queries submitted to such a CBIR system. According to these results, we propose an effective way to exploit this locality, by means of a similarity caching system, which stores the results of recently/frequently submitted queries and associated results. Unlike traditional caching, the proposed cache can manage not only exact hits, but also approximate ones that are solved by similarity with respect to the result sets of past queries present in the cache. We evaluate extensively the proposed solution by using the real query stream recorded in the log and a collection of 100 millions of digital photographs. The high hit ratios and small average approximation error figures obtained demonstrate the effectiveness of the approach.
2012
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Inglese
48
5
803
818
16
http://www.sciencedirect.com/science/article/pii/S030645731000107X
Sì, ma tipo non specificato
Caching
Multimedia search
Content-based search
Large scale
H.5.1 Multimedia Information Systems
Tipo Progetto EU_FP7 Software Services and Systems Network (S-Cube) Acronimo S-CUBE Grant agreement 215483
5
info:eu-repo/semantics/article
262
Falchi, F; Lucchese, C; Orlando, S; Perego, R; Rabitti, F
01 Contributo su Rivista::01.01 Articolo in rivista
restricted
   Software Services and Systems Network (S-Cube)
   S-CUBE
   FP7
   215483
File in questo prodotto:
File Dimensione Formato  
prod_199497-doc_43714.pdf

solo utenti autorizzati

Descrizione: Similarity caching in large-scale image retrieval
Tipologia: Versione Editoriale (PDF)
Dimensione 4.83 MB
Formato Adobe PDF
4.83 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
prod_199497-doc_90292.pdf

solo utenti autorizzati

Descrizione: Similarity caching in large-scale image retrieval
Tipologia: Versione Editoriale (PDF)
Dimensione 837.29 kB
Formato Adobe PDF
837.29 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/21682
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 24
  • ???jsp.display-item.citation.isi??? 19
social impact