CNR Institutional Research Information System

We propose an incremental algorithm for discovering clusters of duplicate tuples in large databases. The core of the approach is the usage of an indexing technique which, for any newly arrived tuple mu, allows to efficiently retrieve a set of tuples in the database which are mostly similar to P, and which are likely to refer to the same real-world entity which is associated with mu. The proposed index is based on a hashing approach which tends to assign similar objects to the same buckets. Empirical and analytical evaluation demonstrates that the proposed approach achieves satisfactory efficiency results, at the cost of low accuracy loss.

Effective Incremental Clustering for Duplicate Detection in Large Databases

Francesco Folino;Giuseppe Manco;Luigi Pontieri

2006

Abstract

We propose an incremental algorithm for discovering clusters of duplicate tuples in large databases. The core of the approach is the usage of an indexing technique which, for any newly arrived tuple mu, allows to efficiently retrieve a set of tuples in the database which are mostly similar to P, and which are likely to refer to the same real-world entity which is associated with mu. The proposed index is based on a hashing approach which tends to assign similar objects to the same buckets. Empirical and analytical evaluation demonstrates that the proposed approach achieves satisfactory efficiency results, at the cost of low accuracy loss.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2006
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Codice ISBN
	
				0-7695-2577-6
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/70001

Citazioni

ND

2

1

social impact