CNR Institutional Research Information System

We discuss work in progress in the semi-automatic generation of emph{thematic lexicons} by means of emph{term categorization}, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and emph{themes} (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each $c_{i}$ in a set $C={c_{1},ldots,c_{m}}$ of themes, a sequence $L^{i}_{0}subseteq L^{i}_{1}subseteq ldots subseteq L^{i}_{n}$ of lexicons, bootstrapping from an initial lexicon $L^{i}_{0}$ and a set of text corpora $Theta={theta_{0},ldots,theta_{n-1}}$ given as input. The method is inspired by emph{text categorization}, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, we formulate the task of term categorization as one in which terms are (dually) represented as vectors in a space of documents, and in which terms (instead of documents) are labelled with themes. As a learning device, we adopt emph{boosting}, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of ``data cleaning'', thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.

Building thematic lexical resources by term categorization

Lavelli A;Magnini B;Sebastiani F

2002

Abstract

We discuss work in progress in the semi-automatic generation of emph{thematic lexicons} by means of emph{term categorization}, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and emph{themes} (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each $c_{i}$ in a set $C={c_{1},ldots,c_{m}}$ of themes, a sequence $L^{i}_{0}subseteq L^{i}_{1}subseteq ldots subseteq L^{i}_{n}$ of lexicons, bootstrapping from an initial lexicon $L^{i}_{0}$ and a set of text corpora $Theta={theta_{0},ldots,theta_{n-1}}$ given as input. The method is inspired by emph{text categorization}, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, we formulate the task of term categorization as one in which terms are (dually) represented as vectors in a space of documents, and in which terms (instead of documents) are labelled with themes. As a learning device, we adopt emph{boosting}, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of ``data cleaning'', thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2002
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Machine learning
Text categorization
Term categorization
Boosting
Lexicon generation
			
	Appare nelle tipologie:
	
				08.04 Rapporto tecnico

File in questo prodotto:

File	Dimensione	Formato
prod_160561-doc_128154.pdf accesso aperto Descrizione: Building Thematic Lexical Resources by Term Categorization Dimensione 168.18 kB Formato Adobe PDF Visualizza/Apri	168.18 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/149580

Citazioni

ND

ND

ND

social impact