We propose a new statistical-learning approach to marrying topic modeling and document clustering. In particular, a Bayesian generative model of text collections is developed, in which the two foresaid tasks are incorporated as coupled latent factors, that govern document wording. The latter consists of word embeddings, so as to capture the semantic and syntactic regularities among words. Collapsed Gibbs sampling is derived mathematically and implemented algorithmically, along with parameter estimation, with the aim to jointly perform topic modeling and document clustering through Bayesian reasoning. Comparative tests on benchmark real-world corpora reveal the effectiveness of the devised approach in clustering collections of text documents and coherently recovering their semantics.

Document clustering meets topic modeling with word embeddings

Costa Gianni;Ortale Riccardo
2020

Abstract

We propose a new statistical-learning approach to marrying topic modeling and document clustering. In particular, a Bayesian generative model of text collections is developed, in which the two foresaid tasks are incorporated as coupled latent factors, that govern document wording. The latter consists of word embeddings, so as to capture the semantic and syntactic regularities among words. Collapsed Gibbs sampling is derived mathematically and implemented algorithmically, along with parameter estimation, with the aim to jointly perform topic modeling and document clustering through Bayesian reasoning. Comparative tests on benchmark real-world corpora reveal the effectiveness of the devised approach in clustering collections of text documents and coherently recovering their semantics.
2020
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Inglese
Proceedings of the 2020 SIAM International Conference on Data Mining
SIAM International Conference on Data Mining (SDM)
244
252
9
9781611976236
http://www.scopus.com/record/display.url?eid=2-s2.0-85085732531&origin=inward
Sì, ma tipo non specificato
07-09/05/2020
Bayesian Text Analysis
Document Clustering
Topic Modeling
Word Embeddings
2
open
Costa, Giovanni; Ortale, Riccardo
273
info:eu-repo/semantics/conferenceObject
04 Contributo in convegno::04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
1.9781611976236.28.pdf

accesso aperto

Licenza: Altro tipo di licenza
Dimensione 508.33 kB
Formato Adobe PDF
508.33 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/381065
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 14
  • ???jsp.display-item.citation.isi??? 10
social impact