Polylingual Text Classification (PLTC) is a supervised learning task that consists of assigning class labels to documents belonging to different languages, assuming a representative set of training documents is available for each language. This scenario is more and more frequent, given the large quantity of multilingual platforms and communities emerging on the Internet. This task is receiving increased attention in the text classification community also due to the new challenge it poses, i.e., how to effectively leverage polylingual resources in order to infer a multilingual classifier and to improve the performance of a monolingual one. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or not always free to use. In this work we analyse some important methods proposed in the literature that are machine translation-free and dictionary-free, including Random Indexing, a method that, to the best of our knowledge, no-one before had tested on PLTC. We offer an analysis on the basis of space and time efficiency, and propose a particular configuration of the Random Indexing method (that we dub Lightweight Random Indexing), that outperforms (showing also a significantly reduced computational cost) all other compared algorithms.
A Comparison of Distributional Semantics Models for Polylingual Text Classification
Esuli A;Moreo Fernández A;Sebastiani F
2015
Abstract
Polylingual Text Classification (PLTC) is a supervised learning task that consists of assigning class labels to documents belonging to different languages, assuming a representative set of training documents is available for each language. This scenario is more and more frequent, given the large quantity of multilingual platforms and communities emerging on the Internet. This task is receiving increased attention in the text classification community also due to the new challenge it poses, i.e., how to effectively leverage polylingual resources in order to infer a multilingual classifier and to improve the performance of a monolingual one. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or not always free to use. In this work we analyse some important methods proposed in the literature that are machine translation-free and dictionary-free, including Random Indexing, a method that, to the best of our knowledge, no-one before had tested on PLTC. We offer an analysis on the basis of space and time efficiency, and propose a particular configuration of the Random Indexing method (that we dub Lightweight Random Indexing), that outperforms (showing also a significantly reduced computational cost) all other compared algorithms.| File | Dimensione | Formato | |
|---|---|---|---|
|
prod_344534-doc_107928.pdf
accesso aperto
Descrizione: A Comparison of Distributional Semantics Models for Polylingual Text Classification
Tipologia:
Versione Editoriale (PDF)
Dimensione
2.83 MB
Formato
Adobe PDF
|
2.83 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


