Efficient processing of similarity joins is important for a large class of data analysis and data-mining applications. This primitive finds all pairs of records within a predefined distance threshold of each other. However, most of the existing approaches have been based on spatial join techniques designed primarily for data in a vector space. Treating data collections as metric objects brings a great advantage in generality, because a single metric technique can be applied to many specific search problems quite different in nature. In this paper, we concentrate our attention on a special form of join, the Self Similarity Join, which retrieves pairs from the same dataset. In particular, we consider the case in which the dataset is split into subsets that are searched for self similarity join independently (e.g, in a distributed computing environment). To this end, we formalize the abstract concept of epsilon-Cover, prove its correctness, and demonstrate its effectiveness by applying it to two real implementations on a real-life large dataset.

Scalability issues for self similarity join in distributed systems

Gennaro C;Rabitti F
2010

Abstract

Efficient processing of similarity joins is important for a large class of data analysis and data-mining applications. This primitive finds all pairs of records within a predefined distance threshold of each other. However, most of the existing approaches have been based on spatial join techniques designed primarily for data in a vector space. Treating data collections as metric objects brings a great advantage in generality, because a single metric technique can be applied to many specific search problems quite different in nature. In this paper, we concentrate our attention on a special form of join, the Self Similarity Join, which retrieves pairs from the same dataset. In particular, we consider the case in which the dataset is split into subsets that are searched for self similarity join independently (e.g, in a distributed computing environment). To this end, we formalize the abstract concept of epsilon-Cover, prove its correctness, and demonstrate its effectiveness by applying it to two real implementations on a real-life large dataset.
2010
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Inglese
The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)
309
316
978-0-7695-3939-3
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5452451&contentType=Conference+Publications&searchField%3DSearch_All%26queryText%3DScalability+issues+for+self+similarity+join+in+distributed+systems
Sì, ma tipo non specificato
17-19 February 2010
Pisa
Information Search and Retrieval
Metric Space
Similartiy self join
Scalability
2
reserved
Gennaro, C; Rabitti, F
273
info:eu-repo/semantics/conferenceObject
04 Contributo in convegno::04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
prod_92131-doc_18820.pdf

non disponibili

Descrizione: Euromicro 2010 Rabitti
Tipologia: Versione Editoriale (PDF)
Dimensione 388.25 kB
Formato Adobe PDF
388.25 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
prod_92131-doc_36541.pdf

non disponibili

Descrizione: articolo pubblicato
Tipologia: Versione Editoriale (PDF)
Dimensione 467.73 kB
Formato Adobe PDF
467.73 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/63133
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact