CNR Institutional Research Information System

Cross-lingual Text Classification(CLC) consists of automatically classifying, according to a common setCofclasses, documents each written in one of a set of languagesL, and doing so more accurately than when"naïvely" classifying each document via its corresponding language-specific classifier. In order to obtain anincrease in the classification accuracy for a given language, the system thus needs to also leverage the trainingexamples written in the other languages. We tackle "multilabel" CLC viafunnelling, a new ensemble learningmethod that we propose here. Funnelling consists of generating a two-tier classification system where alldocuments, irrespectively of language, are classified by the same (2nd-tier) classifier. For this classifier alldocuments are represented in a common, language-independent feature space consisting of the posteriorprobabilities generated by 1st-tier, language-dependent classifiers. This allows the classification of all testdocuments, of any language, to benefit from the information present in all training documents, of any language.We present substantial experiments, run on publicly available multilingual text collections, in which funnellingis shown to significantly outperform a number of state-of-the-art baselines. All code and datasets (in vectorform) are made publicly available.

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Cross-Lingual Text Classification

Esuli A;Moreo Fernandez A D;Sebastiani F

2019

Abstract

Cross-lingual Text Classification(CLC) consists of automatically classifying, according to a common setCofclasses, documents each written in one of a set of languagesL, and doing so more accurately than when"naïvely" classifying each document via its corresponding language-specific classifier. In order to obtain anincrease in the classification accuracy for a given language, the system thus needs to also leverage the trainingexamples written in the other languages. We tackle "multilabel" CLC viafunnelling, a new ensemble learningmethod that we propose here. Funnelling consists of generating a two-tier classification system where alldocuments, irrespectively of language, are classified by the same (2nd-tier) classifier. For this classifier alldocuments are represented in a common, language-independent feature space consisting of the posteriorprobabilities generated by 1st-tier, language-dependent classifiers. This allows the classification of all testdocuments, of any language, to benefit from the information present in all training documents, of any language.We present substantial experiments, run on publicly available multilingual text collections, in which funnellingis shown to significantly outperform a number of state-of-the-art baselines. All code and datasets (in vectorform) are made publicly available.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Lingua/e
	
				Inglese
			
	Rivista
	
				ACM TRANSACTIONS ON INFORMATION SYSTEMS
			
	Codice Web of Science
	
				WOS:000495417300009
			
	Volume
	
				37
			
	Fascicolo
	
				3
			
	Da pagina
	
				1
			
	A pagina
	
				30
			
	Numero di pagine
	
				30
			
	Codice DOI
	
				https://dx.doi.org/10.1145/3326065
			
	Codice Scopus
	
				2-s2.0-85068863736
			
	URL
	
				https://dl.acm.org/doi/abs/10.1145/3326065
			
	Referee
	
				Sì, ma tipo non specificato
			
	Parole chiave
	
				E-discovery
Technology-Assisted Review
Utility Theory
Semi-automated Text Classification
			
	Altre informazioni
	
				The present work has been supported by the ARIADNEplus project, funded by the European Commission (Grant 823914) under the H2020 Programme INFRAIA-2018-1. The authors' opinions do not necessarily reflect those of the European Commission.
			
	Numero autori
	
				3
			
	Tipologia
	
				info:eu-repo/semantics/article
			
	Tipologia Login Miur
	
				262
			
	Tutti gli autori
	
						Esuli, A; Moreo Fernandez, A D; Sebastiani, F
					
	Tipologia
	
				01 Contributo su Rivista::01.01 Articolo in rivista
			
	Fulltext
	
				partially_open
			
	Identificativo progetto
	
	Titolo Progetto
	
									Advanced Research Infrastructure for Archaeological Data Networking in Europe - plus
								
	Acronimo
	
									ARIADNEplus
								
	Finanziamento
	
									H2020
								
	N. Contratto
	
									823914
								
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_403485-doc_140464.pdf solo utenti autorizzati Descrizione: Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Cross-Lingual Text Classification Tipologia: Versione Editoriale (PDF) Dimensione 1.08 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.08 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
prod_403485-doc_159212.pdf accesso aperto Descrizione: Postprint - Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Cross-Lingual Text Classification Tipologia: Versione Editoriale (PDF) Dimensione 1.08 MB Formato Adobe PDF Visualizza/Apri	1.08 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/360765

Citazioni

ND

25

17

social impact