The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.
Distributional random oversampling for imbalanced text classification
Moreo Fernandez A;Esuli A;Sebastiani F
2016
Abstract
The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.File | Dimensione | Formato | |
---|---|---|---|
prod_356991-doc_116357.pdf
solo utenti autorizzati
Descrizione: Distributional random oversampling for imbalanced text classification
Tipologia:
Versione Editoriale (PDF)
Dimensione
512.29 kB
Formato
Adobe PDF
|
512.29 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
prod_356991-doc_159207.pdf
accesso aperto
Descrizione: Distributional random oversampling for imbalanced text classification
Tipologia:
Versione Editoriale (PDF)
Dimensione
252.93 kB
Formato
Adobe PDF
|
252.93 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.