Clustering and representation learning are foundational tasks in natural language processing and text mining, aiming to structure and semantically encode text documents, respectively. While clustering organizes documents into cohesive groups based on similarity, representation learning generates low-dimensional embeddings that capture the nuances of document semantics. This paper presents a novel knowledge-enhanced approach that synergistically integrates these two tasks, improving performance in both areas. Our method employs a latent-factor Bayesian generative model, named MINING ( docuMent clusterINg and embeddING), along with a specialized collapsed Gibbs sampling algorithm. We enrich the learned representations by incorporating external knowledge from word and entity embeddings, enhancing their semantic and syntactic richness. Our approach treats clustering and representation learning as interdependent tasks, allowing them to inform and refine one another. Extensive experiments on benchmark datasets demonstrate that our integrated approach outperforms traditional methods that carry out clustering and representation learning as separate tasks.

Unifying clustering and representation learning for unsupervised text analysis: A Bayesian knowledge-enhanced approach

Costa G.
Co-primo
;
Ortale R.
Co-primo
2025

Abstract

Clustering and representation learning are foundational tasks in natural language processing and text mining, aiming to structure and semantically encode text documents, respectively. While clustering organizes documents into cohesive groups based on similarity, representation learning generates low-dimensional embeddings that capture the nuances of document semantics. This paper presents a novel knowledge-enhanced approach that synergistically integrates these two tasks, improving performance in both areas. Our method employs a latent-factor Bayesian generative model, named MINING ( docuMent clusterINg and embeddING), along with a specialized collapsed Gibbs sampling algorithm. We enrich the learned representations by incorporating external knowledge from word and entity embeddings, enhancing their semantic and syntactic richness. Our approach treats clustering and representation learning as interdependent tasks, allowing them to inform and refine one another. Extensive experiments on benchmark datasets demonstrate that our integrated approach outperforms traditional methods that carry out clustering and representation learning as separate tasks.
2025
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Text analysis
Text clustering
Text representation learning
Knowledge graph (embeddings)
Word embeddings
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S156625352400664X-main.pdf

solo utenti autorizzati

Descrizione: Articolo pubblicato in versione PDF
Tipologia: Versione Editoriale (PDF)
Licenza: Altro tipo di licenza
Dimensione 4.16 MB
Formato Adobe PDF
4.16 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/559847
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 3
social impact