CNR Institutional Research Information System

Clustering and representation learning are foundational tasks in natural language processing and text mining, aiming to structure and semantically encode text documents, respectively. While clustering organizes documents into cohesive groups based on similarity, representation learning generates low-dimensional embeddings that capture the nuances of document semantics. This paper presents a novel knowledge-enhanced approach that synergistically integrates these two tasks, improving performance in both areas. Our method employs a latent-factor Bayesian generative model, named MINING ( docuMent clusterINg and embeddING), along with a specialized collapsed Gibbs sampling algorithm. We enrich the learned representations by incorporating external knowledge from word and entity embeddings, enhancing their semantic and syntactic richness. Our approach treats clustering and representation learning as interdependent tasks, allowing them to inform and refine one another. Extensive experiments on benchmark datasets demonstrate that our integrated approach outperforms traditional methods that carry out clustering and representation learning as separate tasks.

Unifying clustering and representation learning for unsupervised text analysis: A Bayesian knowledge-enhanced approach

Costa G.^Co-primo;Ortale R.^Co-primo

2025

Abstract

Clustering and representation learning are foundational tasks in natural language processing and text mining, aiming to structure and semantically encode text documents, respectively. While clustering organizes documents into cohesive groups based on similarity, representation learning generates low-dimensional embeddings that capture the nuances of document semantics. This paper presents a novel knowledge-enhanced approach that synergistically integrates these two tasks, improving performance in both areas. Our method employs a latent-factor Bayesian generative model, named MINING ( docuMent clusterINg and embeddING), along with a specialized collapsed Gibbs sampling algorithm. We enrich the learned representations by incorporating external knowledge from word and entity embeddings, enhancing their semantic and syntactic richness. Our approach treats clustering and representation learning as interdependent tasks, allowing them to inform and refine one another. Extensive experiments on benchmark datasets demonstrate that our integrated approach outperforms traditional methods that carry out clustering and representation learning as separate tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Parole chiave
	
				Text analysis
Text clustering
Text representation learning
Knowledge graph (embeddings)
Word embeddings
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S156625352400664X-main.pdf solo utenti autorizzati Descrizione: Articolo pubblicato in versione PDF Tipologia: Versione Editoriale (PDF) Licenza: Altro tipo di licenza Dimensione 4.16 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	4.16 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/559847

Citazioni

ND

2

3

social impact