Topic modeling can be synergically interrelated with document clustering. We present an innovative unsupervised approach to the interrelationship of topic modeling with document clustering. The devised approach exploits Bayesian generative modeling and posterior inference, to seamlessly unify and jointly carry out the two tasks, respectively. Specifically, a Bayesian nonparametric model of text collections, formulates an unprecedented interrelationship of word-embedding topics with a Dirichlet process mixture of cluster components. The latter enables countably infinite clusters and permits the automatic inference of their actual number in a statistically principled manner. All latent clusters and topics under the foresaid model are inferred through collapsed Gibbs sampling and parameter estimation. An extensive empirical study of the presented approach is effected on benchmark real-world corpora of text documents. The experimental results demonstrate its higher effectiveness in partitioning text collections and coherently discovering their semantics, compared to state-of-the-art competitors and tailored baselines. Computational efficiency is also looked into under different conditions, in order to provide an insightful analysis of scalability.

Effective interrelation of Bayesian nonparametric document clustering and embedded-topic modeling

Gianni Costa;Riccardo Ortale
2021

Abstract

Topic modeling can be synergically interrelated with document clustering. We present an innovative unsupervised approach to the interrelationship of topic modeling with document clustering. The devised approach exploits Bayesian generative modeling and posterior inference, to seamlessly unify and jointly carry out the two tasks, respectively. Specifically, a Bayesian nonparametric model of text collections, formulates an unprecedented interrelationship of word-embedding topics with a Dirichlet process mixture of cluster components. The latter enables countably infinite clusters and permits the automatic inference of their actual number in a statistically principled manner. All latent clusters and topics under the foresaid model are inferred through collapsed Gibbs sampling and parameter estimation. An extensive empirical study of the presented approach is effected on benchmark real-world corpora of text documents. The experimental results demonstrate its higher effectiveness in partitioning text collections and coherently discovering their semantics, compared to state-of-the-art competitors and tailored baselines. Computational efficiency is also looked into under different conditions, in order to provide an insightful analysis of scalability.
2021
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Text analysis
Word embeddi
Topic modeling
Document clustering
Bayesian nonparametrics
Dirichlet process clustering
File in questo prodotto:
File Dimensione Formato  
KBS_2021.pdf

solo utenti autorizzati

Descrizione: Editoriale
Tipologia: Versione Editoriale (PDF)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 900.7 kB
Formato Adobe PDF
900.7 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/429181
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? 8
social impact