The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

Distributional random oversampling for imbalanced text classification

Moreo Fernandez A;Esuli A;Sebastiani F
2016

Abstract

The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.
2016
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Inglese
Raffaele Perego, Fabrizio sebastiani, Aslam, Javed, Ruthven, Ian, Zobel, Justin
SIGIR 2016 - 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
805
808
978-1-4503-4069-4
http://dl.acm.org/citation.cfm?id=2914722&CFID=812657189&CFTOKEN=16638796
ACM Press
New York
STATI UNITI D'AMERICA
Sì, ma tipo non specificato
17-21 July 2016
Pisa, Italy
Distributional semantics
ARTIFICIAL INTELLIGENCE. Learning
3
partially_open
Moreo Fernandez A.; Esuli A.; Sebastiani F.
273
info:eu-repo/semantics/conferenceObject
04 Contributo in convegno::04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
prod_356991-doc_116357.pdf

solo utenti autorizzati

Descrizione: Distributional random oversampling for imbalanced text classification
Tipologia: Versione Editoriale (PDF)
Dimensione 512.29 kB
Formato Adobe PDF
512.29 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
prod_356991-doc_159207.pdf

accesso aperto

Descrizione: Distributional random oversampling for imbalanced text classification
Tipologia: Versione Editoriale (PDF)
Dimensione 252.93 kB
Formato Adobe PDF
252.93 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/320945
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 124
  • ???jsp.display-item.citation.isi??? 91
social impact