Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computationally and financially costly, requiring substantial energy consumption and contributing to carbon dioxide emissions. This article focuses on advancing the state-of-the-art (SOTA) on instance selection (IS) – a range of document filtering techniques designed to select the most representative documents for the sake of training. The objective is to either maintain or enhance classification effectiveness while reducing the overall training (fine-tuning) total processing time. In our prior research, we introduced the E2SC framework, a redundancy-oriented IS method focused on transformers and large datasets – currently the state-of-the-art in IS. Nonetheless, important research questions remained unanswered in our previous work, mostly due to E2SC’s sole emphasis on redundancy. In this article, we take our research a step further by proposing biO-IS – an extended bi-objective instance selection solution, a novel IS framework aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on scalable, fast, and calibrated weak classifiers and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our extended solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, not even our previous SOTA solution, was capable of achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. To ensure reproducibility, our documentation, code, and datasets can be accessed on GitHub – https://github.com/waashk/bio-is.

A noise-oriented and redundancy-aware instance selection framework

Moreo Fernandez A.;Esuli A.;Sebastiani F.;
2024

Abstract

Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computationally and financially costly, requiring substantial energy consumption and contributing to carbon dioxide emissions. This article focuses on advancing the state-of-the-art (SOTA) on instance selection (IS) – a range of document filtering techniques designed to select the most representative documents for the sake of training. The objective is to either maintain or enhance classification effectiveness while reducing the overall training (fine-tuning) total processing time. In our prior research, we introduced the E2SC framework, a redundancy-oriented IS method focused on transformers and large datasets – currently the state-of-the-art in IS. Nonetheless, important research questions remained unanswered in our previous work, mostly due to E2SC’s sole emphasis on redundancy. In this article, we take our research a step further by proposing biO-IS – an extended bi-objective instance selection solution, a novel IS framework aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on scalable, fast, and calibrated weak classifiers and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our extended solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, not even our previous SOTA solution, was capable of achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. To ensure reproducibility, our documentation, code, and datasets can be accessed on GitHub – https://github.com/waashk/bio-is.
2024
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Instance Selection
Document Filtering
Transformer-Based Text Classification
File in questo prodotto:
File Dimensione Formato  
ACM_TOIS_2024_0450___A_Noise_Oriented_and_Redundancy_Aware_Instance_Selection_Framework__Copy_.pdf

accesso aperto

Descrizione: A Noise-Oriented and Redundancy-Aware Instance Selection Framework
Tipologia: Documento in Pre-print
Licenza: Creative commons
Dimensione 1.51 MB
Formato Adobe PDF
1.51 MB Adobe PDF Visualizza/Apri
Esuli-Moreo-Sebastiani_ACM TOIS-2024.pdf

accesso aperto

Descrizione: A Noise-Oriented and Redundancy-Aware Instance Selection Framework
Tipologia: Documento in Post-print
Licenza: Altro tipo di licenza
Dimensione 943.56 kB
Formato Adobe PDF
943.56 kB Adobe PDF Visualizza/Apri
3705000.pdf

solo utenti autorizzati

Descrizione: ANoise-Oriented and Redundancy-Aware Instance Selection Framework
Tipologia: Versione Editoriale (PDF)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 1.74 MB
Formato Adobe PDF
1.74 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/525055
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 3
social impact