CNR Institutional Research Information System

We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet Sigma of size sigma is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that requires (|T|log sigma/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.

Space-efficient substring occurrence estimation

Orlandi Alessio;Venturini Rossano

2011

Abstract

We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet Sigma of size sigma is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that requires (|T|log sigma/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2011
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Lingua/e
	
				Inglese
			
	Titolo del convegno
	
				30th symposium on Principles of Database Systems of Data, PODS 2011
			
	Da pagina
	
				95
			
	A pagina
	
				106
			
	Codice ISBN
	
				978-1-4503-0660-7
			
	Codice DOI
	
				https://dx.doi.org/10.1145/1989284.1989300
			
	URL
	
				http://portal.acm.org/citation.cfm?id=1989300&CFID=29642294&CFTOKEN=82582261
			
	Nome Editore
	
				ACM Press
			
	Città Editore
	
				New York
			
	Nazione Editore
	
				STATI UNITI D'AMERICA
			
	Referee
	
				Sì, ma tipo non specificato
			
	Periodo del Convegno
	
				12-16 June 2011
			
	Luogo del Convegno
	
				Athens, Greece
			
	Parole chiave
	
				Full text Index
Text indexing
			
	Altre informazioni
	
				ID_PUMA: cnr.isti/cnr.isti/2011-A2-018. - ID Modulo Commessa 2103 - ICT.P09.006.001 - 074 - Tecnologie avanzate, Sistemi e Servizi per Grid. - Area di valutazione 01 - Scienze matematiche e informatiche
			
	Codice Scopus
	
				2-s2.0-79960186706
			
	Numero autori
	
				1
			
	Fulltext
	
				restricted
			
	Tutti gli autori
	
						Orlandi, Alessio ; Venturini, Rossano
					
	Tipologia Login Miur
	
				273
			
	Tipologia
	
				info:eu-repo/semantics/conferenceObject
			
	Tipologia
	
				04 Contributo in convegno::04.01 Contributo in Atti di convegno
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_199911-doc_43901.pdf solo utenti autorizzati Descrizione: Space-efficient substring occurrence estimation Tipologia: Versione Editoriale (PDF) Dimensione 253.51 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	253.51 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/18975

Citazioni

ND

6

ND

social impact