Polylingual Text Classification (PLTC) is a supervised learning task that consists of assigning class labels to documents belonging to different languages, assuming a representative set of training documents is available for each language. This scenario is more and more frequent, given the large quantity of multilingual platforms and communities emerging on the Internet. This task is receiving increased attention in the text classification community also due to the new challenge it poses, i.e., how to effectively leverage polylingual resources in order to infer a multilingual classifier and to improve the performance of a monolingual one. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or not always free to use. In this work we analyse some important methods proposed in the literature that are machine translation-free and dictionary-free, including Random Indexing, a method that, to the best of our knowledge, no-one before had tested on PLTC. We offer an analysis on the basis of space and time efficiency, and propose a particular configuration of the Random Indexing method (that we dub Lightweight Random Indexing), that outperforms (showing also a significantly reduced computational cost) all other compared algorithms.

A Comparison of Distributional Semantics Models for Polylingual Text Classification

Esuli A;Moreo Fernández A;Sebastiani F
2015

Abstract

Polylingual Text Classification (PLTC) is a supervised learning task that consists of assigning class labels to documents belonging to different languages, assuming a representative set of training documents is available for each language. This scenario is more and more frequent, given the large quantity of multilingual platforms and communities emerging on the Internet. This task is receiving increased attention in the text classification community also due to the new challenge it poses, i.e., how to effectively leverage polylingual resources in order to infer a multilingual classifier and to improve the performance of a monolingual one. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or not always free to use. In this work we analyse some important methods proposed in the literature that are machine translation-free and dictionary-free, including Random Indexing, a method that, to the best of our knowledge, no-one before had tested on PLTC. We offer an analysis on the basis of space and time efficiency, and propose a particular configuration of the Random Indexing method (that we dub Lightweight Random Indexing), that outperforms (showing also a significantly reduced computational cost) all other compared algorithms.
2015
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Polylingual text classification
Distributional semantic models
Random Indexing
File in questo prodotto:
File Dimensione Formato  
prod_344534-doc_107928.pdf

accesso aperto

Descrizione: A Comparison of Distributional Semantics Models for Polylingual Text Classification
Tipologia: Versione Editoriale (PDF)
Dimensione 2.83 MB
Formato Adobe PDF
2.83 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/334431
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact