In this paper we study the problem of estimating the number of occurrences of substrings in textual data: A text on some alphabet of length is preprocessed and an index is built. The index is used in lieu of the text to answer queries of the form , returning an approximated number of the occurrences of an arbitrary pattern as a substring of . The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that, for a given additive error , requires bits. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are supported by experiments showing the practical impact of our data structures.

Space-Efficient Substring Occurrence Estimation

Venturini R
2016

Abstract

In this paper we study the problem of estimating the number of occurrences of substrings in textual data: A text on some alphabet of length is preprocessed and an index is built. The index is used in lieu of the text to answer queries of the form , returning an approximated number of the occurrences of an arbitrary pattern as a substring of . The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that, for a given additive error , requires bits. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are supported by experiments showing the practical impact of our data structures.
2016
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Compressed full-text indexing
Pattern matching
Full-text indexing
Data structures
File in questo prodotto:
File Dimensione Formato  
prod_366887-doc_121223.pdf

solo utenti autorizzati

Descrizione: Space-Efficient Substring Occurrence Estimation
Tipologia: Versione Editoriale (PDF)
Dimensione 993.82 kB
Formato Adobe PDF
993.82 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/329621
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? ND
social impact