CNR Institutional Research Information System

The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

Distributional random oversampling for imbalanced text classification

Moreo Fernandez A;Esuli A;Sebastiani F

2016

Abstract

The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2016
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Lingua/e
	
				Inglese
			
	Supervisori e coordinatori esterni
	
				Raffaele Perego, Fabrizio sebastiani, Aslam, Javed, Ruthven, Ian, Zobel, Justin
			
	Titolo del convegno
	
				SIGIR 2016 - 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
			
	Da pagina
	
				805
			
	A pagina
	
				808
			
	Codice ISBN
	
				978-1-4503-4069-4
			
	Codice DOI
	
				https://dx.doi.org/10.1145/2911451.2914722
			
	URL
	
				http://dl.acm.org/citation.cfm?id=2914722&CFID=812657189&CFTOKEN=16638796
			
	Nome Editore
	
				ACM Press
			
	Città Editore
	
				New York
			
	Nazione Editore
	
				STATI UNITI D'AMERICA
			
	Referee
	
				Sì, ma tipo non specificato
			
	Periodo del Convegno
	
				17-21 July 2016
			
	Luogo del Convegno
	
				Pisa, Italy
			
	Parole chiave
	
				Distributional semantics
ARTIFICIAL INTELLIGENCE. Learning
			
	Codice Scopus
	
				2-s2.0-84980410173
			
	Codice Web of Science
	
				WOS:000455100800111
			
	Numero autori
	
				3
			
	Fulltext
	
				partially_open
			
	Tutti gli autori
	
						Moreo Fernandez A.; Esuli A.; Sebastiani F.
					
	Tipologia Login Miur
	
				273
			
	Tipologia
	
				info:eu-repo/semantics/conferenceObject
			
	Tipologia
	
				04 Contributo in convegno::04.01 Contributo in Atti di convegno
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_356991-doc_116357.pdf solo utenti autorizzati Descrizione: Distributional random oversampling for imbalanced text classification Tipologia: Versione Editoriale (PDF) Dimensione 512.29 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	512.29 kB	Adobe PDF	Visualizza/Apri Richiedi una copia
prod_356991-doc_159207.pdf accesso aperto Descrizione: Distributional random oversampling for imbalanced text classification Tipologia: Versione Editoriale (PDF) Dimensione 252.93 kB Formato Adobe PDF Visualizza/Apri	252.93 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/320945

Citazioni

ND

124

91

social impact