Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as sequences of d_gaps (i.e. differences among successive document identifiers) compressed using variable length encoding methods. This paper describes the use of a lightweight clustering algorithm aimed at assigning the identifiers to documents in a way that minimizes the average values of d_gaps. The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d_gap encoding chosen, up to 23% smaller than the one built over randomly assigned document identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time.

Assigning document identifiers to enhance compressibility of web search engines indexes

Silvestri F;Perego R;Orlando S
2004

Abstract

Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as sequences of d_gaps (i.e. differences among successive document identifiers) compressed using variable length encoding methods. This paper describes the use of a lightweight clustering algorithm aimed at assigning the identifiers to documents in a way that minimizes the average values of d_gaps. The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d_gap encoding chosen, up to 23% smaller than the one built over randomly assigned document identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time.
2004
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
1-58113-812-1
Compression
Information retrieval
Clustering
File in questo prodotto:
File Dimensione Formato  
prod_91052-doc_125281.pdf

solo utenti autorizzati

Descrizione: Assigning document identifiers to enhance compressibility of Web Search Engines indexes
Tipologia: Versione Editoriale (PDF)
Dimensione 300.1 kB
Formato Adobe PDF
300.1 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/57512
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 23
  • ???jsp.display-item.citation.isi??? ND
social impact