CNR Institutional Research Information System

In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set tk of n word stems, and we say that tk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in tk in some order. Previous researches have investigated the use of n-grams (or some variant of them) in the context of specific learning algorithms, and thus have not obtained general answers on their usefulness for TC. In this work we investigate the usefulness of n-grams in TC independently of any specific learning algorithm. We do so by applying feature selection to the pool of all ?-grams (? <= n), and checking how many n-grams score high enough to be selected in the top ? ?-grams. We report the results of our experiments, using several feature selection functions and varying values of ?, performed on the Reuters-21578 standard TC benchmark. We also report results of making actual use of the selected n-grams in the context of a linear classifier induced by means of the Rocchio method.

Statistical phrases in automated text categorization

Caropreso MF;Matwin S;Sebastiani F

2000

Abstract

In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set tk of n word stems, and we say that tk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in tk in some order. Previous researches have investigated the use of n-grams (or some variant of them) in the context of specific learning algorithms, and thus have not obtained general answers on their usefulness for TC. In this work we investigate the usefulness of n-grams in TC independently of any specific learning algorithm. We do so by applying feature selection to the pool of all ?-grams (? <= n), and checking how many n-grams score high enough to be selected in the top ? ?-grams. We report the results of our experiments, using several feature selection functions and varying values of ?, performed on the Reuters-21578 standard TC benchmark. We also report results of making actual use of the selected n-grams in the context of a linear classifier induced by means of the Rocchio method.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2000
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Machine learning
Text categorisation
Text classif
Information filtering
Performance evaluation (efficiency and effectiveness)
Induction
			
	Appare nelle tipologie:
	
				08.04 Rapporto tecnico

File in questo prodotto:

File	Dimensione	Formato
prod_406950-doc_142462.pdf accesso aperto Descrizione: Statistical phrases in automated text categorization Dimensione 227.89 kB Formato Adobe PDF Visualizza/Apri	227.89 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/361909

Citazioni

ND

ND

ND

social impact