Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate kNN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.

Scalable k-NN based text clustering

2015

Abstract

Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate kNN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.
2015
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
978-1-4799-9925-5
Clustering Algorithm
Distributed architectures
File in questo prodotto:
File Dimensione Formato  
prod_347340-doc_109258.pdf

solo utenti autorizzati

Descrizione: Scalable k-NN based text clustering
Tipologia: Versione Editoriale (PDF)
Dimensione 209.67 kB
Formato Adobe PDF
209.67 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
prod_347340-doc_156947.pdf

accesso aperto

Descrizione: Scalable k-NN based text clustering
Tipologia: Versione Editoriale (PDF)
Dimensione 483.68 kB
Formato Adobe PDF
483.68 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/315394
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 14
  • ???jsp.display-item.citation.isi??? 10
social impact