CNR Institutional Research Information System

Multilingual Text Classification(MLTC) is a text classification task in which documents arewritten each in one among a setLof natural languages, and in which all documents must beclassified under the same classification scheme, irrespective of language. There are two mainvariants of MLTC, namelyCross-Lingual Text Classification(CLTC) andPolylingual TextClassification(PLTC). In PLTC, which is the focus of this paper, we assume (differentlyfrom CLTC) that for each language inLthere is a representative set of training documents;PLTC consists of improving the accuracy of each of the|L|monolingual classifiers byalso leveraging the training documents written in the other (|L| -1) languages. Theobvious solution, consisting of generating a single polylingual classifier from the juxtaposedmonolingual vector spaces, is usually infeasible, since the dimensionality of the resultingvector space is roughly|L|times that of a monolingual one, and is thus often unmanageable.As a response, the use of machine translation tools or multilingual dictionaries has beenproposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowl-edge, has never been applied to PLTC before, isRandom Indexing(RI). We analyse RI interms of space and time efficiency, and propose a particular configuration of it (that wedubLightweight Random Indexing- LRI). By running experiments on two well known pub-lic benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallelone), we show LRI to outperform (both in terms of effectiveness and efficiency) a numberof previously proposed machine-translation-free and dictionary-free PLTC methods thatwe use as baselines.

Lightweight random indexing for polylingual text classification

Moreo Fernandez A;Esuli A;Sebastiani F

2016

Abstract

Multilingual Text Classification(MLTC) is a text classification task in which documents arewritten each in one among a setLof natural languages, and in which all documents must beclassified under the same classification scheme, irrespective of language. There are two mainvariants of MLTC, namelyCross-Lingual Text Classification(CLTC) andPolylingual TextClassification(PLTC). In PLTC, which is the focus of this paper, we assume (differentlyfrom CLTC) that for each language inLthere is a representative set of training documents;PLTC consists of improving the accuracy of each of the|L|monolingual classifiers byalso leveraging the training documents written in the other (|L| -1) languages. Theobvious solution, consisting of generating a single polylingual classifier from the juxtaposedmonolingual vector spaces, is usually infeasible, since the dimensionality of the resultingvector space is roughly|L|times that of a monolingual one, and is thus often unmanageable.As a response, the use of machine translation tools or multilingual dictionaries has beenproposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowl-edge, has never been applied to PLTC before, isRandom Indexing(RI). We analyse RI interms of space and time efficiency, and propose a particular configuration of it (that wedubLightweight Random Indexing- LRI). By running experiments on two well known pub-lic benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallelone), we show LRI to outperform (both in terms of effectiveness and efficiency) a numberof previously proposed machine-translation-free and dictionary-free PLTC methods thatwe use as baselines.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2016
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Polylingual text classification
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_359535-doc_117979.pdf accesso aperto Descrizione: Lightweight random indexing for polylingual text classification Tipologia: Versione Editoriale (PDF) Dimensione 1.94 MB Formato Adobe PDF Visualizza/Apri	1.94 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/316537

Citazioni

ND

10

10

social impact