This paper proposes an algorithm, named HWK-Sets, based on K-Means, suited for clustering data which are variable-sized sets of elementary items. Clustering sets is difficult because data objects do not have numerical attributes and it is not possible to use the classical Euclidean distance upon which K-Means is normally based. An adaptation of the Jaccard distance between sets is used, which exploits application-sensitive information. More in particular, the Hartigan and Wong variation of K-Means is adopted which uses medoids as cluster representatives, can work with several seeding methods and can favor the fast attainment of a careful solution. The paper introduces HWK-Sets which is implemented in Java by parallel streams. Then, the efficiency and accuracy of HWK-Sets are demonstrated by simulation experiments.

An Efficient Algorithm for Clustering Sets

Franco Cicirelli
2023

Abstract

This paper proposes an algorithm, named HWK-Sets, based on K-Means, suited for clustering data which are variable-sized sets of elementary items. Clustering sets is difficult because data objects do not have numerical attributes and it is not possible to use the classical Euclidean distance upon which K-Means is normally based. An adaptation of the Jaccard distance between sets is used, which exploits application-sensitive information. More in particular, the Hartigan and Wong variation of K-Means is adopted which uses medoids as cluster representatives, can work with several seeding methods and can favor the fast attainment of a careful solution. The paper introduces HWK-Sets which is implemented in Java by parallel streams. Then, the efficiency and accuracy of HWK-Sets are demonstrated by simulation experiments.
2023
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Clustering sets
Hartigan & Wong K-Means
Jaccard distance
Medoids
Seeding methods
benchmark datasets.
File in questo prodotto:
File Dimensione Formato  
prod_486109-doc_201615.pdf

solo utenti autorizzati

Descrizione: An Efficient Algorithm for Clustering Sets
Tipologia: Documento in Pre-print
Licenza: Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023)
Dimensione 499.18 kB
Formato Adobe PDF
499.18 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/463518
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact