CNR Institutional Research Information System

In this paper a new definition of distance-based outlier and an algorithm, called {\em HilOut}, designed to efficiently detect the top $n$ outliers of a large and high-dimensional data set are proposed. Given an integer $k$, the weight of a point is defined as the sum of the distances separating it from its $k$ nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm {\em HilOut} makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most $d + 1$ sorts and scans of the data set, with temporal cost quadratic in $d$ and linear in $N$ and in $k$, where $d$ is the number of dimensions of the data set and $N$ is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes $n$, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than $d + 1$ steps. We present both an in-memory and disk-based implementation of the {\em HilOut} algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.

Outlier Mining in Large High-Dimensional Data Sets

Fabrizio Angiulli;Clara Pizzuti

2005

Abstract

In this paper a new definition of distance-based outlier and an algorithm, called {\em HilOut}, designed to efficiently detect the top $n$ outliers of a large and high-dimensional data set are proposed. Given an integer $k$, the weight of a point is defined as the sum of the distances separating it from its $k$ nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm {\em HilOut} makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most $d + 1$ sorts and scans of the data set, with temporal cost quadratic in $d$ and linear in $N$ and in $k$, where $d$ is the number of dimensions of the data set and $N$ is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes $n$, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than $d + 1$ steps. We present both an in-memory and disk-based implementation of the {\em HilOut} algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2005
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Parole chiave
	
				outlier mining
space filling curves
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/13411

Citazioni

ND

ND

ND

social impact