Multilingual Text Classification(MLTC) is a text classification task in which documents arewritten each in one among a setLof natural languages, and in which all documents must beclassified under the same classification scheme, irrespective of language. There are two mainvariants of MLTC, namelyCross-Lingual Text Classification(CLTC) andPolylingual TextClassification(PLTC). In PLTC, which is the focus of this paper, we assume (differentlyfrom CLTC) that for each language inLthere is a representative set of training documents;PLTC consists of improving the accuracy of each of the|L|monolingual classifiers byalso leveraging the training documents written in the other (|L| -1) languages. Theobvious solution, consisting of generating a single polylingual classifier from the juxtaposedmonolingual vector spaces, is usually infeasible, since the dimensionality of the resultingvector space is roughly|L|times that of a monolingual one, and is thus often unmanageable.As a response, the use of machine translation tools or multilingual dictionaries has beenproposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowl-edge, has never been applied to PLTC before, isRandom Indexing(RI). We analyse RI interms of space and time efficiency, and propose a particular configuration of it (that wedubLightweight Random Indexing- LRI). By running experiments on two well known pub-lic benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallelone), we show LRI to outperform (both in terms of effectiveness and efficiency) a numberof previously proposed machine-translation-free and dictionary-free PLTC methods thatwe use as baselines.

Lightweight random indexing for polylingual text classification

Moreo Fernandez A;Esuli A;Sebastiani F
2016

Abstract

Multilingual Text Classification(MLTC) is a text classification task in which documents arewritten each in one among a setLof natural languages, and in which all documents must beclassified under the same classification scheme, irrespective of language. There are two mainvariants of MLTC, namelyCross-Lingual Text Classification(CLTC) andPolylingual TextClassification(PLTC). In PLTC, which is the focus of this paper, we assume (differentlyfrom CLTC) that for each language inLthere is a representative set of training documents;PLTC consists of improving the accuracy of each of the|L|monolingual classifiers byalso leveraging the training documents written in the other (|L| -1) languages. Theobvious solution, consisting of generating a single polylingual classifier from the juxtaposedmonolingual vector spaces, is usually infeasible, since the dimensionality of the resultingvector space is roughly|L|times that of a monolingual one, and is thus often unmanageable.As a response, the use of machine translation tools or multilingual dictionaries has beenproposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowl-edge, has never been applied to PLTC before, isRandom Indexing(RI). We analyse RI interms of space and time efficiency, and propose a particular configuration of it (that wedubLightweight Random Indexing- LRI). By running experiments on two well known pub-lic benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallelone), we show LRI to outperform (both in terms of effectiveness and efficiency) a numberof previously proposed machine-translation-free and dictionary-free PLTC methods thatwe use as baselines.
2016
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Polylingual text classification
File in questo prodotto:
File Dimensione Formato  
prod_359535-doc_117979.pdf

accesso aperto

Descrizione: Lightweight random indexing for polylingual text classification
Tipologia: Versione Editoriale (PDF)
Dimensione 1.94 MB
Formato Adobe PDF
1.94 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/316537
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? ND
social impact