In many applications there is a need to monitor how a population is distributed across different classes, and to track the changes in this distribution that derive from varying circumstances; an example such application is monitoring the percentage (or "prevalence") of unemployed people in a given region, or in a given age range, or at different time periods. When the membership of an individual in a class cannot be established deterministically, this monitoring activity requires classification. However, in the above applications the final goal is not determining which class each individual belongs to, but simply estimating the prevalence of each class in the unlabeled data. This task is called quantification. In a supervised learning framework we may estimate the distribution across the classes in a test set from a training set of labeled individuals. However, this may be suboptimal, since the distribution in the test set may be substantially different from that in the training set (a phenomenon called distribution drift). So far, quantification has mostly been addressed by learning a classifier optimized for individual classification and later adjusting the distribution it computes to compensate for its tendency to either under- or over-estimate the prevalence of the class. In this paper we propose instead to use a type of decision trees (quantification trees) optimized not for individual classification, but directly for quantification. Our experiments show that quantification trees are more accurate than existing state-of-the-art quantification methods, while retaining at the same time the simplicity and understandability of the decision tree framework.

Quantification trees

Milli L;Rossetti G;Giannotti F;Sebastiani F
2013

Abstract

In many applications there is a need to monitor how a population is distributed across different classes, and to track the changes in this distribution that derive from varying circumstances; an example such application is monitoring the percentage (or "prevalence") of unemployed people in a given region, or in a given age range, or at different time periods. When the membership of an individual in a class cannot be established deterministically, this monitoring activity requires classification. However, in the above applications the final goal is not determining which class each individual belongs to, but simply estimating the prevalence of each class in the unlabeled data. This task is called quantification. In a supervised learning framework we may estimate the distribution across the classes in a test set from a training set of labeled individuals. However, this may be suboptimal, since the distribution in the test set may be substantially different from that in the training set (a phenomenon called distribution drift). So far, quantification has mostly been addressed by learning a classifier optimized for individual classification and later adjusting the distribution it computes to compensate for its tendency to either under- or over-estimate the prevalence of the class. In this paper we propose instead to use a type of decision trees (quantification trees) optimized not for individual classification, but directly for quantification. Our experiments show that quantification trees are more accurate than existing state-of-the-art quantification methods, while retaining at the same time the simplicity and understandability of the decision tree framework.
2013
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Quantification
Decision trees
Text classification
I.2.6 Learning
File in questo prodotto:
File Dimensione Formato  
prod_277762-doc_78334.pdf

solo utenti autorizzati

Descrizione: Quantification Trees
Tipologia: Versione Editoriale (PDF)
Dimensione 223.88 kB
Formato Adobe PDF
223.88 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/244965
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 43
  • ???jsp.display-item.citation.isi??? ND
social impact