K-means is a widespread clustering algorithm characterized by its simplicity and efficiency. K-means behavior, though, strongly depends on the initialization of the cluster centers (centroids) and tends to be stuck in a local suboptimal solution. Many techniques have been devised to overcome these problems, e.g., by a global strategy to reduce locality of centroid adjustments or by using density peaks for centroids initialization. This paper proposes an improved version of the K-means—DPKM-based on the concepts of density peaks. Density peaks have been proved to be a key for solving clustering problems where not-spherical regions with complex point distributions are involved. Centroids are actually selected from density peaks by using a technique borrowed from the DK-means++ initialization method, which ensures centroids are not only points with higher density, but also far-away from each other. DPKM is implemented in Java using parallel streams and lambda expressions which are capable of delivering good execution times on large datasets on multicore machines with shared memory. The efficiency and reliability of DPKM are demonstrated by applying it to challenging synthetic datasets often used as benchmarks for clustering methods.

Fast and Accurate K-means Clustering Based on Density Peaks

Cicirelli F.
2023

Abstract

K-means is a widespread clustering algorithm characterized by its simplicity and efficiency. K-means behavior, though, strongly depends on the initialization of the cluster centers (centroids) and tends to be stuck in a local suboptimal solution. Many techniques have been devised to overcome these problems, e.g., by a global strategy to reduce locality of centroid adjustments or by using density peaks for centroids initialization. This paper proposes an improved version of the K-means—DPKM-based on the concepts of density peaks. Density peaks have been proved to be a key for solving clustering problems where not-spherical regions with complex point distributions are involved. Centroids are actually selected from density peaks by using a technique borrowed from the DK-means++ initialization method, which ensures centroids are not only points with higher density, but also far-away from each other. DPKM is implemented in Java using parallel streams and lambda expressions which are capable of delivering good execution times on large datasets on multicore machines with shared memory. The efficiency and reliability of DPKM are demonstrated by applying it to challenging synthetic datasets often used as benchmarks for clustering methods.
2023
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
9789819932498
9789819932504
Centroids initialization
Clustering
Density peaks
DK-means++
Java
K-means
Lambda expressions
Multicore machines
Parallel streams
File in questo prodotto:
File Dimensione Formato  
978-981-99-3250-4_59.pdf

solo utenti autorizzati

Tipologia: Versione Editoriale (PDF)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 474.55 kB
Formato Adobe PDF
474.55 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/559747
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact