The construction of a text classi.er usually involves (i) a phase of term selection, in which the most relevant terms for the classi.cation task are identi.ed, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classi.er learning, in which a classi.er is generated from the weighted representations of the training documents. This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also a.ect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classi.er learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.
Supervised term weighting for automated text categorization
Debole F;Sebastiani F
2003
Abstract
The construction of a text classi.er usually involves (i) a phase of term selection, in which the most relevant terms for the classi.cation task are identi.ed, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classi.er learning, in which a classi.er is generated from the weighted representations of the training documents. This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also a.ect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classi.er learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.File | Dimensione | Formato | |
---|---|---|---|
prod_91002-doc_123700.pdf
solo utenti autorizzati
Descrizione: Supervised term weighting for automated text categorization
Tipologia:
Versione Editoriale (PDF)
Dimensione
105.36 kB
Formato
Adobe PDF
|
105.36 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.