The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.

Sorting out the document identifier assignment problem

Silvestri F
2007

Abstract

The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.
2007
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
H.3 Information Storage and Retrieval
H.3.1 Content Analysis and Indexing. Indexing Methods
Identifier assignment
Indexing technique
Information retrieval index compression
File in questo prodotto:
File Dimensione Formato  
prod_44015-doc_21553.pdf

solo utenti autorizzati

Descrizione: "Sorting out the document identifier assignment problem"
Tipologia: Versione Editoriale (PDF)
Dimensione 172.4 kB
Formato Adobe PDF
172.4 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
prod_44015-doc_36539.pdf

solo utenti autorizzati

Descrizione: articolo pubblicato
Tipologia: Versione Editoriale (PDF)
Dimensione 396.05 kB
Formato Adobe PDF
396.05 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/43612
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 71
  • ???jsp.display-item.citation.isi??? ND
social impact