Building accurate text classifiers using small labeled data is a critical challenge in production systems and high-stake applications (e.g., misinformation mitigation, clinical decision support). Organizations struggle with expensive data annotation costs and need solutions that can adapt to data drifts over time. Despite their amazing results, contemporary generative Large Language Models are unsuitable for production classification tasks due to their high computational demands and inconsistent predictions. To tackle these issues, this work introduces a lightweight modular framework combining an encoder-only LM with weak supervision from pre-trained Natural Language Inference (NLI) models. The NLI models are used as zero-shot annotators, forming an ensemble that generates pseudo-labels to augment the set of training examples. Unreliable pseudo-labeled instances are discarded by using an abstention-based mechanism, based on label agreement and risk-bounded rejection thresholds. The remaining pseudo-labeled data are used to fine-tune the encoder-only classifier with minimal training overhead. Major strengths of the framework are: (i) synergistic integration of complementary pre-trained models, (ii) a cheap abstention mechanism for improving pseudo-label quality, and (iii) small computational costs (thanks to the use of small LMs and few training data and fine-tuning epochs). Tests on fake news and depression detection datasets show that this approach achieves compelling performances despite using a fraction of the computational burden required by generative LLMs and state-of-the-art semi-supervised-learning methods.
Sustainable hybrid text classification: Enhancing encoder-only language models with NLI-derived pseudo-examples
Folino F.;Pontieri L.
2026
Abstract
Building accurate text classifiers using small labeled data is a critical challenge in production systems and high-stake applications (e.g., misinformation mitigation, clinical decision support). Organizations struggle with expensive data annotation costs and need solutions that can adapt to data drifts over time. Despite their amazing results, contemporary generative Large Language Models are unsuitable for production classification tasks due to their high computational demands and inconsistent predictions. To tackle these issues, this work introduces a lightweight modular framework combining an encoder-only LM with weak supervision from pre-trained Natural Language Inference (NLI) models. The NLI models are used as zero-shot annotators, forming an ensemble that generates pseudo-labels to augment the set of training examples. Unreliable pseudo-labeled instances are discarded by using an abstention-based mechanism, based on label agreement and risk-bounded rejection thresholds. The remaining pseudo-labeled data are used to fine-tune the encoder-only classifier with minimal training overhead. Major strengths of the framework are: (i) synergistic integration of complementary pre-trained models, (ii) a cheap abstention mechanism for improving pseudo-label quality, and (iii) small computational costs (thanks to the use of small LMs and few training data and fine-tuning epochs). Tests on fake news and depression detection datasets show that this approach achieves compelling performances despite using a fraction of the computational burden required by generative LLMs and state-of-the-art semi-supervised-learning methods.| File | Dimensione | Formato | |
|---|---|---|---|
|
ARRAY_2026.pdf
solo utenti autorizzati
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
2.23 MB
Formato
Adobe PDF
|
2.23 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


