Approximate k-Closest-Pairs  in Large High-Dimensional Data Sets

Angiulli, Fabrizio; Pizzuti, Clara

doi:10.1007/s10852-004-4080-3

An approximate algorithm to efficiently solve the {\it k-Closest-Pairs} problem on large high-dimensional data sets is presented. The algorithm runs, for a suitable choice of the input parameters, in ${\cal O}(d^2 n k)$ time, where $d$ is the dimensionality and $n$ is the number of points of the input data set, and requires linear space in the input size. It performs at most $d+1$ iterations. At each iteration a shifted version of the data set is sequentially scanned according to the order induced on it by the Hilbert space filling curve and points whose contribution to the solution has already been analyzed are detected and eliminated. The pruning is lossless, in fact the remaining points along with the approximate solution found can be used for the computation of the exact solution. If the data set is entirely pruned, then the algorithm returns the exact solution. We prove that the pruning ability of the algorithm is related to the nearest neighbor distance distribution of the data set and show that there exists a class of data sets for which the method, augmented with a final step that applies an exact method to the reduced data set, calculates the exact solution with the same time requirements. Although we are able to guarantee a ${\cal O}(d^{1+ \frac{1}{t}})$ approximation to the solution, where $t \in \{ 1,2,\ldots,\infty \}$ identifies the Minkowski ($L_t$) metric of interest, experimental results give the exact $k$ closest pairs for all the large high-dimensional synthetic and real data sets considered and show that the pruning of the search space is effective. % We present a thorough scaling analysis of the algorithm for in-memory and disk-resident data sets showing that the algorithm scales well in both cases.

Approximate k-Closest-Pairs in Large High-Dimensional Data Sets

Fabrizio Angiulli;Clara Pizzuti

2005

Abstract

An approximate algorithm to efficiently solve the {\it k-Closest-Pairs} problem on large high-dimensional data sets is presented. The algorithm runs, for a suitable choice of the input parameters, in ${\cal O}(d^2 n k)$ time, where $d$ is the dimensionality and $n$ is the number of points of the input data set, and requires linear space in the input size. It performs at most $d+1$ iterations. At each iteration a shifted version of the data set is sequentially scanned according to the order induced on it by the Hilbert space filling curve and points whose contribution to the solution has already been analyzed are detected and eliminated. The pruning is lossless, in fact the remaining points along with the approximate solution found can be used for the computation of the exact solution. If the data set is entirely pruned, then the algorithm returns the exact solution. We prove that the pruning ability of the algorithm is related to the nearest neighbor distance distribution of the data set and show that there exists a class of data sets for which the method, augmented with a final step that applies an exact method to the reduced data set, calculates the exact solution with the same time requirements. Although we are able to guarantee a ${\cal O}(d^{1+ \frac{1}{t}})$ approximation to the solution, where $t \in \{ 1,2,\ldots,\infty \}$ identifies the Minkowski ($L_t$) metric of interest, experimental results give the exact $k$ closest pairs for all the large high-dimensional synthetic and real data sets considered and show that the pruning of the search space is effective. % We present a thorough scaling analysis of the algorithm for in-memory and disk-resident data sets showing that the algorithm scales well in both cases.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2005
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Parole chiave
	
				$k$-Closest-Pairs problem
Space Filling Curves
approximate algorithms
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/13390

Citazioni

ND

ND

ND

CNR Institutional Research Information System

Approximate k-Closest-Pairs in Large High-Dimensional Data Sets

Fabrizio Angiulli;Clara Pizzuti

2005

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

CNR Institutional Research Information System

Approximate k-Closest-Pairs in Large High-Dimensional Data Sets

Fabrizio Angiulli;Clara Pizzuti

2005

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)