CNR Institutional Research Information System

The construction of a text classifier usually involves (i) a phase of emph{term selection}, in which the most relevant terms for the classification task are identified, (ii) a phase of emph{term weighting}, in which document weights for the selected terms are computed, and (iii) a phase of emph{classifier learning}, in which a classifier is generated from the weighted representations of the training documents. This process involves an activity of {em supervised learning}, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from the training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea emph{supervised term weighting} (STW). As an example of STW, we propose a number of ``supervised variants'' of $tfidf$ weighting, obtained by replacing the $idf$ function with the function that has been used in phase (i) for term selection. The use of STW allows the terms that are distributed most differently in the positive and negative examples of the categories of interest to be weighted highest. We present experimental results obtained on the standard textsf{Reuters-21578} benchmark with three classifier learning methods (Rocchio, $k$-NN, and support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

Supervised term weighting for automated text categorization

Debole F;Sebastiani F

2002

Abstract

The construction of a text classifier usually involves (i) a phase of emph{term selection}, in which the most relevant terms for the classification task are identified, (ii) a phase of emph{term weighting}, in which document weights for the selected terms are computed, and (iii) a phase of emph{classifier learning}, in which a classifier is generated from the weighted representations of the training documents. This process involves an activity of {em supervised learning}, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from the training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea emph{supervised term weighting} (STW). As an example of STW, we propose a number of ``supervised variants'' of $tfidf$ weighting, obtained by replacing the $idf$ function with the function that has been used in phase (i) for term selection. The use of STW allows the terms that are distributed most differently in the positive and negative examples of the categories of interest to be weighted highest. We present experimental results obtained on the standard textsf{Reuters-21578} benchmark with three classifier learning methods (Rocchio, $k$-NN, and support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2002
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Supervised term weighting
Text categorization
Text classification
Support vector machines
Supervised learning
Term weighting
			
	Appare nelle tipologie:
	
				08.04 Rapporto tecnico

File in questo prodotto:

File	Dimensione	Formato
prod_160620-doc_122822.pdf accesso aperto Descrizione: Supervised term weighting for automated text categorization Dimensione 197.55 kB Formato Adobe PDF Visualizza/Apri	197.55 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/152129

Citazioni

ND

ND

ND

social impact