We discuss an approach to the automatic expansion of domain-specific lexicons, i.e., to the problem of extending, for each ci in a predefined set C = {c1, . . . , cm} of semantic domains, an initial lexicon Li 0 into a larger lexicon Li 1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem, in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks, in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as 'implicit' representations for our terms.
Automatic expansion of domain-specific lexicons by term categorization
Avancini H;Sebastiani F;
2006
Abstract
We discuss an approach to the automatic expansion of domain-specific lexicons, i.e., to the problem of extending, for each ci in a predefined set C = {c1, . . . , cm} of semantic domains, an initial lexicon Li 0 into a larger lexicon Li 1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem, in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks, in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as 'implicit' representations for our terms.File | Dimensione | Formato | |
---|---|---|---|
prod_68347-doc_34198.pdf
solo utenti autorizzati
Descrizione: Articolo pubblicato
Tipologia:
Versione Editoriale (PDF)
Dimensione
575.46 kB
Formato
Adobe PDF
|
575.46 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.