CNR Institutional Research Information System

Most popular feature selection (FS) methods for text classification (TC) such as information gain (a.k.a. mutual information), chi-square, and odds ratio, are based on binary information concerning the presence/absence of the feature in each training document. As such, these methods do not exploit a rich source of information, namely, the information concerning how frequently the feature occurs in each training document (term frequency). In order to overcome this drawback we break down each training document of length k into k training "micro-documents", each consisting of a single word occurrence and endowed with the same class information of the original training document. This move has the double effect of (a) allowing all the original FS methods to be still straightforwardly applicable, and (b) making them sensitive to term frequency. We study the impact of this strategy in the case of ordinal TC, using four recently introduced FS functions, two SVM-based learning methods, and two large datasets of product reviews. The experiments show that the use of this strategy substantially improves the accuracy of ordinal TC.

Using micro-documents for feature selection: the case of ordinal text classification

Baccianella S;Esuli A;Sebastiani F

2011

Abstract

Most popular feature selection (FS) methods for text classification (TC) such as information gain (a.k.a. mutual information), chi-square, and odds ratio, are based on binary information concerning the presence/absence of the feature in each training document. As such, these methods do not exploit a rich source of information, namely, the information concerning how frequently the feature occurs in each training document (term frequency). In order to overcome this drawback we break down each training document of length k into k training "micro-documents", each consisting of a single word occurrence and endowed with the same class information of the original training document. This move has the double effect of (a) allowing all the original FS methods to be still straightforwardly applicable, and (b) making them sensitive to term frequency. We study the impact of this strategy in the case of ordinal TC, using four recently introduced FS functions, two SVM-based learning methods, and two large datasets of product reviews. The experiments show that the use of this strategy substantially improves the accuracy of ordinal TC.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2011
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Feature selection
Ordinal regression
Ordinal classification
Text classification
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_206834-doc_46606.pdf solo utenti autorizzati Descrizione: Using micro-documents for feature selection: the case of ordinal text classification Tipologia: Versione Editoriale (PDF) Dimensione 205.63 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	205.63 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/180967

Citazioni

ND

0

ND

social impact