CNR Institutional Research Information System

Topic modeling can be unified synergically with document clustering. In this manuscript, we propose two innovative unsupervised approaches for the combined modeling and interrelated accomplishment of the two tasks. Both approaches rely on respective Bayesian generative models of topics, contents and clusters in textual corpora. Such models treat topics and clusters as linked latent factors in document wording. In particular, under the generative model of the second approach, textual documents are characterized by topic distributions, that are allowed to vary around the topic distributions of their membership clusters. Within the devised models, algorithms are designed to implement Rao-Blackwellized Gibbs sampling together with parameter estimation. These are derived mathematically for carrying out topic modeling with document clustering in a simultaneous and interrelated manner. A comparative empirical evaluation demonstrates the effectiveness of the presented approaches, over different families of state-of-the-art competitors, in clustering real-world benchmark text collections and, also, uncovering their underlying semantics. Besides, a case study is developed as an insightful qualitative analysis of results on real-world text corpora.

Hierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clusters

Gianni Costa;Riccardo Ortale

2022

Abstract

Topic modeling can be unified synergically with document clustering. In this manuscript, we propose two innovative unsupervised approaches for the combined modeling and interrelated accomplishment of the two tasks. Both approaches rely on respective Bayesian generative models of topics, contents and clusters in textual corpora. Such models treat topics and clusters as linked latent factors in document wording. In particular, under the generative model of the second approach, textual documents are characterized by topic distributions, that are allowed to vary around the topic distributions of their membership clusters. Within the devised models, algorithms are designed to implement Rao-Blackwellized Gibbs sampling together with parameter estimation. These are derived mathematically for carrying out topic modeling with document clustering in a simultaneous and interrelated manner. A comparative empirical evaluation demonstrates the effectiveness of the presented approaches, over different families of state-of-the-art competitors, in clustering real-world benchmark text collections and, also, uncovering their underlying semantics. Besides, a case study is developed as an insightful qualitative analysis of results on real-world text corpora.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Parole chiave
	
				Bayesian text analysis
Topic modeling
Document clustering
Hierarchical priors
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
IJAR_2022.pdf solo utenti autorizzati Descrizione: Editoriale Tipologia: Versione Editoriale (PDF) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 660.74 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	660.74 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/417075

Citazioni

ND

5

2

social impact