Document clustering and topic modeling are fundamental tasks in text mining, that can be unified to reciprocally enhance each other. In this paper, we present a machine learning approach to the joint modeling and interdependent fulfilment of both tasks. In particular, document clustering and topic modeling are seamlessly interrelated under an innovative Bayesian generative model of clusters, topics and contents in text corpora. Such a model assumes that text corpora result from a generative process, in which clusters and topics act as connected latent factors. Essentially, clusters are initially associated with descriptive and actionable topic distributions, that enforce cluster coherence. The individual documents are then assigned to one respective cluster and worded accordingly. Under the devised model, document clustering and topic modeling can be simultaneously performed in an interdependent manner simply by Bayesian reasoning. For this purpose, the mathematical details regarding collapsed Gibbs sampling as well as parameter estimation are derived and implemented into an approximate inference algorithm. Comparative experiments on standard benchmark text corpora reveal the effectiveness of our approach at jointly clustering text documents and unveiling their semantics in terms of coherent topics.
Document Clustering and Topic Modeling: A Unified Bayesian Probabilistic Perspective
Gianni Costa;Riccardo Ortale
2019
Abstract
Document clustering and topic modeling are fundamental tasks in text mining, that can be unified to reciprocally enhance each other. In this paper, we present a machine learning approach to the joint modeling and interdependent fulfilment of both tasks. In particular, document clustering and topic modeling are seamlessly interrelated under an innovative Bayesian generative model of clusters, topics and contents in text corpora. Such a model assumes that text corpora result from a generative process, in which clusters and topics act as connected latent factors. Essentially, clusters are initially associated with descriptive and actionable topic distributions, that enforce cluster coherence. The individual documents are then assigned to one respective cluster and worded accordingly. Under the devised model, document clustering and topic modeling can be simultaneously performed in an interdependent manner simply by Bayesian reasoning. For this purpose, the mathematical details regarding collapsed Gibbs sampling as well as parameter estimation are derived and implemented into an approximate inference algorithm. Comparative experiments on standard benchmark text corpora reveal the effectiveness of our approach at jointly clustering text documents and unveiling their semantics in terms of coherent topics.| File | Dimensione | Formato | |
|---|---|---|---|
|
ICTAI_2019.pdf
solo utenti autorizzati
Tipologia:
Versione Editoriale (PDF)
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
195.19 kB
Formato
Adobe PDF
|
195.19 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


