Hierarchical text classification is a challenging task, in particular when complex taxonomies, characterized by multi-level labeling structures, need to be handled. A critical aspect of the task lies in the scarcity of labeled data capable of representing the entire spectrum of taxonomy labels. To address this, we propose HTC-GEN, a novel framework that leverages on synthetic data generation by means of large language models, with a specific focus on LLama2. LLama2 generates coherent, contextually relevant text samples across hierarchical levels, faithfully emulating the intricate patterns of real-world text data. HTC-GEN obviates the need for laborintensive human annotation required to build data for training supervised models. The proposed methodology effectively handles the common issue of imbalanced datasets, enabling robust generalization for labels with minimal or missing real-world data. We test our approach on a widely recognized benchmark dataset for hierarchical zero-shot text classification, demonstrating superior performance compared to the state-of-theart zero-shot model. Our findings underscore the significant potential of synthetic-data-driven solutions to effectively address the intricate challenges of hierarchical text classification.

HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification

Longo C. F.
Primo
Methodology
;
Bulla L.
Penultimo
Writing – Original Draft Preparation
;
Tuccari G. G.
Ultimo
Data Curation
2024

Abstract

Hierarchical text classification is a challenging task, in particular when complex taxonomies, characterized by multi-level labeling structures, need to be handled. A critical aspect of the task lies in the scarcity of labeled data capable of representing the entire spectrum of taxonomy labels. To address this, we propose HTC-GEN, a novel framework that leverages on synthetic data generation by means of large language models, with a specific focus on LLama2. LLama2 generates coherent, contextually relevant text samples across hierarchical levels, faithfully emulating the intricate patterns of real-world text data. HTC-GEN obviates the need for laborintensive human annotation required to build data for training supervised models. The proposed methodology effectively handles the common issue of imbalanced datasets, enabling robust generalization for labels with minimal or missing real-world data. We test our approach on a widely recognized benchmark dataset for hierarchical zero-shot text classification, demonstrating superior performance compared to the state-of-theart zero-shot model. Our findings underscore the significant potential of synthetic-data-driven solutions to effectively address the intricate challenges of hierarchical text classification.
2024
Istituto di Scienze e Tecnologie della Cognizione - ISTC - Sede Secondaria Catania
978-989-758-707-8
Hierarchical Text Classification, Synthetic Data Generation, Large Language Models
File in questo prodotto:
File Dimensione Formato  
127907.pdf

accesso aperto

Descrizione: Longo, C., Mongiovı̀, M., Bulla, L. and Tuccari, G. (2024). HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-707-8; ISSN 2184-285X, SciTePress, pages 129-138. DOI: 10.5220/0012790700003756
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 292.09 kB
Formato Adobe PDF
292.09 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/514320
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact