CNR Institutional Research Information System

A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented. A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data set guarantees a comparable quality of the prediction, but at a lower computational cost.

Distance-Based Detection and Prediction of Outliers

Angiulli Fabrizio;Basta Stefano;Pizzuti Clara

2006

Abstract

A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented. A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data set guarantees a comparable quality of the prediction, but at a lower computational cost.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2006
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Lingua/e
	
				Inglese
			
	Rivista
	
				IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (PRINT)
			
	Codice Web of Science
	
				WOS:000234653100001
			
	Volume
	
				18
			
	Fascicolo
	
				2
			
	Da pagina
	
				145
			
	A pagina
	
				160
			
	Codice DOI
	
				https://dx.doi.org/10.1109/TKDE.2006.29
			
	Codice Scopus
	
				2-s2.0-33244494218
			
	Referee
	
				Sì, ma tipo non specificato
			
	Parole chiave
	
				Distance-based outliers
outlier detection
outlier prediction
data mining
			
	Numero autori
	
				2
			
	Tipologia
	
				info:eu-repo/semantics/article
			
	Tipologia Login Miur
	
				262
			
	Tutti gli autori
	
						Angiulli Fabrizio; Basta Stefano; Pizzuti Clara
					
	Tipologia
	
				01 Contributo su Rivista::01.01 Articolo in rivista
			
	Fulltext
	
				none
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/126614

Citazioni

ND

250

170

social impact