Given a set of records, a threshold value t and a similarity function, we investigate the problem of finding all pairs of records such that similarity between each pair is above t. We propose several optimizations on the existing approaches to solve the problem. Our algorithm outperforms the state-of-the-art algorithms in the case with large and high-dimensional datasets. The speedup we achieved varied from 30% to 4-x depending on the similarity threshold and the dataset properties.

An incremental prefix filtering approach for the all pairs similarity search problem

Perego R;Silvestri F
2010

Abstract

Given a set of records, a threshold value t and a similarity function, we investigate the problem of finding all pairs of records such that similarity between each pair is above t. We propose several optimizations on the existing approaches to solve the problem. Our algorithm outperforms the state-of-the-art algorithms in the case with large and high-dimensional datasets. The speedup we achieved varied from 30% to 4-x depending on the similarity threshold and the dataset properties.
2010
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
978-0-7695-4012-2
Database Application. Data Mining
All pair similarity search
Optimization
Prefix filtering
File in questo prodotto:
File Dimensione Formato  
prod_92122-doc_131201.pdf

solo utenti autorizzati

Descrizione: An incremental prefix filtering approach for the all pairs similarity search problem
Tipologia: Versione Editoriale (PDF)
Dimensione 588.06 kB
Formato Adobe PDF
588.06 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/63124
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact