CNR Institutional Research Information System

We discuss an approach to the automatic expansion of domain-specific lexicons by means of term categorization, a novel task employing techniques from information retrieval and machine learning. Specifically, we view the expansion of such lexicons as a process of learning previously unknown associations between terms and domains (i.e. disciplines, or fields of activity). The process generates, for each c_i in a set C={c_1,..,c_m} of domains, a lexicon L^i_1, bootstrapping from an initial lexicon L^i_0 and a set of documents T given as input. The method is inspired by text categorization, the discipline concerned with labeling natural language texts with labels from a predefined set of domains, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, we formulate the task of term categorization as one in which terms are (dually) represented as vectors in a space of documents, and in which terms (instead of documents) are labeled with domains. As a learning device we adopt a boosting-based method, since boosting (a) has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) naturally allows for a form of 'data cleaning', thereby making the process of generating a lexicon an iteration of generate-and-test steps. We present the results of a number of experiments using a set of domain-specific lexicons called WordNetDomains (which actually consists of an extension of WordNet), and performed using the documents in the Reuters Corpus Volume 1 as 'implicit' representations for our terms.

Automatic Expansion of Domain-Specific Lexicons by Term Categorization

Avancini H;Lavelli A;Sebastiani F;Zanoli R

2004

Abstract

We discuss an approach to the automatic expansion of domain-specific lexicons by means of term categorization, a novel task employing techniques from information retrieval and machine learning. Specifically, we view the expansion of such lexicons as a process of learning previously unknown associations between terms and domains (i.e. disciplines, or fields of activity). The process generates, for each c_i in a set C={c_1,..,c_m} of domains, a lexicon L^i_1, bootstrapping from an initial lexicon L^i_0 and a set of documents T given as input. The method is inspired by text categorization, the discipline concerned with labeling natural language texts with labels from a predefined set of domains, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, we formulate the task of term categorization as one in which terms are (dually) represented as vectors in a space of documents, and in which terms (instead of documents) are labeled with domains. As a learning device we adopt a boosting-based method, since boosting (a) has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) naturally allows for a form of 'data cleaning', thereby making the process of generating a lexicon an iteration of generate-and-test steps. We present the results of a number of experiments using a set of domain-specific lexicons called WordNetDomains (which actually consists of an extension of WordNet), and performed using the documents in the Reuters Corpus Volume 1 as 'implicit' representations for our terms.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2004
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Lexicons
Text Classification
Machine Learning
			
	Appare nelle tipologie:
	
				08.04 Rapporto tecnico

File in questo prodotto:

File	Dimensione	Formato
prod_160659-doc_125271.pdf accesso aperto Descrizione: Automatic Expansion of Domain-Specific Lexicons by Term Categorization Dimensione 334.81 kB Formato Adobe PDF Visualizza/Apri	334.81 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/152168

Citazioni

ND

ND

ND

social impact