CNR Institutional Research Information System

Document clustering and topic modeling are fundamental tasks in text mining, that can be unified to reciprocally enhance each other. In this paper, we present a machine learning approach to the joint modeling and interdependent fulfilment of both tasks. In particular, document clustering and topic modeling are seamlessly interrelated under an innovative Bayesian generative model of clusters, topics and contents in text corpora. Such a model assumes that text corpora result from a generative process, in which clusters and topics act as connected latent factors. Essentially, clusters are initially associated with descriptive and actionable topic distributions, that enforce cluster coherence. The individual documents are then assigned to one respective cluster and worded accordingly. Under the devised model, document clustering and topic modeling can be simultaneously performed in an interdependent manner simply by Bayesian reasoning. For this purpose, the mathematical details regarding collapsed Gibbs sampling as well as parameter estimation are derived and implemented into an approximate inference algorithm. Comparative experiments on standard benchmark text corpora reveal the effectiveness of our approach at jointly clustering text documents and unveiling their semantics in terms of coherent topics.

Document Clustering and Topic Modeling: A Unified Bayesian Probabilistic Perspective

Gianni Costa;Riccardo Ortale

2019

Abstract

Document clustering and topic modeling are fundamental tasks in text mining, that can be unified to reciprocally enhance each other. In this paper, we present a machine learning approach to the joint modeling and interdependent fulfilment of both tasks. In particular, document clustering and topic modeling are seamlessly interrelated under an innovative Bayesian generative model of clusters, topics and contents in text corpora. Such a model assumes that text corpora result from a generative process, in which clusters and topics act as connected latent factors. Essentially, clusters are initially associated with descriptive and actionable topic distributions, that enforce cluster coherence. The individual documents are then assigned to one respective cluster and worded accordingly. Under the devised model, document clustering and topic modeling can be simultaneously performed in an interdependent manner simply by Bayesian reasoning. For this purpose, the mathematical details regarding collapsed Gibbs sampling as well as parameter estimation are derived and implemented into an approximate inference algorithm. Comparative experiments on standard benchmark text corpora reveal the effectiveness of our approach at jointly clustering text documents and unveiling their semantics in terms of coherent topics.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Parole chiave
	
				Document Clustering
Topic Modeling
Bayesian Text Analysis
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
ICTAI_2019.pdf solo utenti autorizzati Tipologia: Versione Editoriale (PDF) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 195.19 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	195.19 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/371444

Citazioni

ND

5

5

social impact