A centroid-based clustering algorithm is proposed that works in a totally unsupervised fashion and is significantly faster and more accurate than existing algorithms. The algorithm, named CLUBS++ (for CLustering Using Binary Splitting), achieves these results by combining features of hierarchical and partition-based algorithms. Thus, CLUBS++ consists of two major phases, i.e., a divisive phase and an agglomerative phase, each followed by a refinement phase. Each major phase consists of successive steps in which the samples are repartitioned using a criterion based on least quadratic distance. This criterion possesses unique analytical properties that are elucidated in the paper and exploited by the algorithm to achieve a very fast computation. The paper presents the results of the extensive experiments performed: these confirm that the new algorithm is fast, impervious to noise, and produces results of better quality than other algorithms, such as BOOL, BIRCH, and k-means++, even when the analyst can determine the correct number of clusters--a very difficult task from which users are spared by CLUBS++.

A fast and accurate algorithm for unsupervised clustering around centroids

Elio Masciari;
2017

Abstract

A centroid-based clustering algorithm is proposed that works in a totally unsupervised fashion and is significantly faster and more accurate than existing algorithms. The algorithm, named CLUBS++ (for CLustering Using Binary Splitting), achieves these results by combining features of hierarchical and partition-based algorithms. Thus, CLUBS++ consists of two major phases, i.e., a divisive phase and an agglomerative phase, each followed by a refinement phase. Each major phase consists of successive steps in which the samples are repartitioned using a criterion based on least quadratic distance. This criterion possesses unique analytical properties that are elucidated in the paper and exploited by the algorithm to achieve a very fast computation. The paper presents the results of the extensive experiments performed: these confirm that the new algorithm is fast, impervious to noise, and produces results of better quality than other algorithms, such as BOOL, BIRCH, and k-means++, even when the analyst can determine the correct number of clusters--a very difficult task from which users are spared by CLUBS++.
2017
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Clu
Experimental assessment
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/327607
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact