Funnelling (Fun) is a method for cross-lingual text classification (CLC) based on a two-tier ensemble for heterogeneous transfer learning. In Fun, 1st-tier classifiers, each working on a different, language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLC systems where these correlations cannot be leveraged. We here describe Generalized Funnelling (gFun), a learning ensemble where the metaclassifier receives as input the above vector of calibrated posterior probabilities, concatenated with document embeddings (aligned across languages) that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings) and word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings). We show that gFun improves on Fun by describing experiments on two large, standard multilingual datasets for multi-label text classification.

Heterogeneous document embeddings for cross-lingual text classification

Moreo A;Pedrotti A;Sebastiani F
2021

Abstract

Funnelling (Fun) is a method for cross-lingual text classification (CLC) based on a two-tier ensemble for heterogeneous transfer learning. In Fun, 1st-tier classifiers, each working on a different, language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLC systems where these correlations cannot be leveraged. We here describe Generalized Funnelling (gFun), a learning ensemble where the metaclassifier receives as input the above vector of calibrated posterior probabilities, concatenated with document embeddings (aligned across languages) that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings) and word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings). We show that gFun improves on Fun by describing experiments on two large, standard multilingual datasets for multi-label text classification.
2021
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
978-1-4503-8104-8
Heterogeneous transfer learning
Transfer learning
Text classification
Ensemble learning
Word embeddings
File in questo prodotto:
File Dimensione Formato  
prod_456428-doc_176637.pdf

accesso aperto

Descrizione: Postprint - Heterogeneous document embeddings for cross-lingual text classification
Tipologia: Versione Editoriale (PDF)
Dimensione 552.8 kB
Formato Adobe PDF
552.8 kB Adobe PDF Visualizza/Apri
prod_456428-doc_176649.pdf

solo utenti autorizzati

Descrizione: Heterogeneous document embeddings for cross-lingual text classification
Tipologia: Versione Editoriale (PDF)
Dimensione 1.32 MB
Formato Adobe PDF
1.32 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/399578
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 1
social impact