Multilingual Text Classification(MLTC) is a text classification task in which documents arewritten each in one among a setLof natural languages, and in which all documents must beclassified under the same classification scheme, irrespective of language. There are two mainvariants of MLTC, namelyCross-Lingual Text Classification(CLTC) andPolylingual TextClassification(PLTC). In PLTC, which is the focus of this paper, we assume (differentlyfrom CLTC) that for each language inLthere is a representative set of training documents;PLTC consists of improving the accuracy of each of the|L|monolingual classifiers byalso leveraging the training documents written in the other (|L| -1) languages. Theobvious solution, consisting of generating a single polylingual classifier from the juxtaposedmonolingual vector spaces, is usually infeasible, since the dimensionality of the resultingvector space is roughly|L|times that of a monolingual one, and is thus often unmanageable.As a response, the use of machine translation tools or multilingual dictionaries has beenproposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowl-edge, has never been applied to PLTC before, isRandom Indexing(RI). We analyse RI interms of space and time efficiency, and propose a particular configuration of it (that wedubLightweight Random Indexing- LRI). By running experiments on two well known pub-lic benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallelone), we show LRI to outperform (both in terms of effectiveness and efficiency) a numberof previously proposed machine-translation-free and dictionary-free PLTC methods thatwe use as baselines.
Lightweight random indexing for polylingual text classification
Moreo Fernandez A;Esuli A;Sebastiani F
2016
Abstract
Multilingual Text Classification(MLTC) is a text classification task in which documents arewritten each in one among a setLof natural languages, and in which all documents must beclassified under the same classification scheme, irrespective of language. There are two mainvariants of MLTC, namelyCross-Lingual Text Classification(CLTC) andPolylingual TextClassification(PLTC). In PLTC, which is the focus of this paper, we assume (differentlyfrom CLTC) that for each language inLthere is a representative set of training documents;PLTC consists of improving the accuracy of each of the|L|monolingual classifiers byalso leveraging the training documents written in the other (|L| -1) languages. Theobvious solution, consisting of generating a single polylingual classifier from the juxtaposedmonolingual vector spaces, is usually infeasible, since the dimensionality of the resultingvector space is roughly|L|times that of a monolingual one, and is thus often unmanageable.As a response, the use of machine translation tools or multilingual dictionaries has beenproposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowl-edge, has never been applied to PLTC before, isRandom Indexing(RI). We analyse RI interms of space and time efficiency, and propose a particular configuration of it (that wedubLightweight Random Indexing- LRI). By running experiments on two well known pub-lic benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallelone), we show LRI to outperform (both in terms of effectiveness and efficiency) a numberof previously proposed machine-translation-free and dictionary-free PLTC methods thatwe use as baselines.File | Dimensione | Formato | |
---|---|---|---|
prod_359535-doc_117979.pdf
accesso aperto
Descrizione: Lightweight random indexing for polylingual text classification
Tipologia:
Versione Editoriale (PDF)
Dimensione
1.94 MB
Formato
Adobe PDF
|
1.94 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.