CNR Institutional Research Information System

In Semi-Automated Text Classification (SATC) an automatic classifier Phi labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by Phi to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D'. An obvious strategy is to rank D so that the documents that Phi has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

A utility-theoretic ranking method for semi-automated text classification.

Berardi G;Esuli A;Sebastiani F

2012

Abstract

In Semi-Automated Text Classification (SATC) an automatic classifier Phi labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by Phi to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D'. An obvious strategy is to rank D so that the documents that Phi has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2012
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				cost-sensitive learning
ranking
semi-automated text classification
supervised learning
text classification
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_218173-doc_51148.pdf solo utenti autorizzati Descrizione: A utility-theoretic ranking method for semi-automated text classification Tipologia: Versione Editoriale (PDF) Dimensione 850.37 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	850.37 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/2684

Citazioni

ND

12

ND

social impact