We discuss an approach to the automatic expansion of domain-specific lexicons, i.e., to the problem of extending, for each ci in a predefined set C = {c1, . . . , cm} of semantic domains, an initial lexicon Li 0 into a larger lexicon Li 1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem, in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks, in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as 'implicit' representations for our terms.

Automatic expansion of domain-specific lexicons by term categorization

Avancini H;Sebastiani F;
2006

Abstract

We discuss an approach to the automatic expansion of domain-specific lexicons, i.e., to the problem of extending, for each ci in a predefined set C = {c1, . . . , cm} of semantic domains, an initial lexicon Li 0 into a larger lexicon Li 1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem, in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks, in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as 'implicit' representations for our terms.
2006
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
I.5.2 Classifier design and evaluation
Lexicons
File in questo prodotto:
File Dimensione Formato  
prod_68347-doc_34198.pdf

solo utenti autorizzati

Descrizione: Articolo pubblicato
Tipologia: Versione Editoriale (PDF)
Dimensione 575.46 kB
Formato Adobe PDF
575.46 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/62927
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? ND
social impact