We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet Sigma of size sigma is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that requires (|T|log sigma/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.

Space-efficient substring occurrence estimation

Venturini Rossano
2011

Abstract

We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet Sigma of size sigma is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that requires (|T|log sigma/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.
2011
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Inglese
30th symposium on Principles of Database Systems of Data, PODS 2011
95
106
978-1-4503-0660-7
http://portal.acm.org/citation.cfm?id=1989300&CFID=29642294&CFTOKEN=82582261
ACM Press
New York
STATI UNITI D'AMERICA
Sì, ma tipo non specificato
12-16 June 2011
Athens, Greece
Full text Index
Text indexing
ID_PUMA: cnr.isti/cnr.isti/2011-A2-018. - ID Modulo Commessa 2103 - ICT.P09.006.001 - 074 - Tecnologie avanzate, Sistemi e Servizi per Grid. - Area di valutazione 01 - Scienze matematiche e informatiche
1
restricted
Orlandi, Alessio ; Venturini, Rossano
273
info:eu-repo/semantics/conferenceObject
04 Contributo in convegno::04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
prod_199911-doc_43901.pdf

solo utenti autorizzati

Descrizione: Space-efficient substring occurrence estimation
Tipologia: Versione Editoriale (PDF)
Dimensione 253.51 kB
Formato Adobe PDF
253.51 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/18975
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? ND
social impact