Clustering and representation learning are foundational tasks in natural language processing and text mining, aiming to structure and semantically encode text documents, respectively. While clustering organizes documents into cohesive groups based on similarity, representation learning generates low-dimensional embeddings that capture the nuances of document semantics. This paper presents a novel knowledge-enhanced approach that synergistically integrates these two tasks, improving performance in both areas. Our method employs a latent-factor Bayesian generative model, named MINING ( docuMent clusterINg and embeddING), along with a specialized collapsed Gibbs sampling algorithm. We enrich the learned representations by incorporating external knowledge from word and entity embeddings, enhancing their semantic and syntactic richness. Our approach treats clustering and representation learning as interdependent tasks, allowing them to inform and refine one another. Extensive experiments on benchmark datasets demonstrate that our integrated approach outperforms traditional methods that carry out clustering and representation learning as separate tasks.
Unifying clustering and representation learning for unsupervised text analysis: A Bayesian knowledge-enhanced approach
Costa G.Co-primo
;Ortale R.
Co-primo
2025
Abstract
Clustering and representation learning are foundational tasks in natural language processing and text mining, aiming to structure and semantically encode text documents, respectively. While clustering organizes documents into cohesive groups based on similarity, representation learning generates low-dimensional embeddings that capture the nuances of document semantics. This paper presents a novel knowledge-enhanced approach that synergistically integrates these two tasks, improving performance in both areas. Our method employs a latent-factor Bayesian generative model, named MINING ( docuMent clusterINg and embeddING), along with a specialized collapsed Gibbs sampling algorithm. We enrich the learned representations by incorporating external knowledge from word and entity embeddings, enhancing their semantic and syntactic richness. Our approach treats clustering and representation learning as interdependent tasks, allowing them to inform and refine one another. Extensive experiments on benchmark datasets demonstrate that our integrated approach outperforms traditional methods that carry out clustering and representation learning as separate tasks.| File | Dimensione | Formato | |
|---|---|---|---|
|
1-s2.0-S156625352400664X-main.pdf
solo utenti autorizzati
Descrizione: Articolo pubblicato in versione PDF
Tipologia:
Versione Editoriale (PDF)
Licenza:
Altro tipo di licenza
Dimensione
4.16 MB
Formato
Adobe PDF
|
4.16 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


