CNR Institutional Research Information System

Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computationally and financially costly, requiring substantial energy consumption and contributing to carbon dioxide emissions. This article focuses on advancing the state-of-the-art (SOTA) on instance selection (IS) – a range of document filtering techniques designed to select the most representative documents for the sake of training. The objective is to either maintain or enhance classification effectiveness while reducing the overall training (fine-tuning) total processing time. In our prior research, we introduced the E2SC framework, a redundancy-oriented IS method focused on transformers and large datasets – currently the state-of-the-art in IS. Nonetheless, important research questions remained unanswered in our previous work, mostly due to E2SC’s sole emphasis on redundancy. In this article, we take our research a step further by proposing biO-IS – an extended bi-objective instance selection solution, a novel IS framework aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on scalable, fast, and calibrated weak classifiers and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our extended solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, not even our previous SOTA solution, was capable of achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. To ensure reproducibility, our documentation, code, and datasets can be accessed on GitHub – https://github.com/waashk/bio-is.

A noise-oriented and redundancy-aware instance selection framework

Cunha W.;Moreo Fernandez A.;Esuli A.;Sebastiani F.;Rocha L.;Gonçalves M. A.

2024

Abstract

Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computationally and financially costly, requiring substantial energy consumption and contributing to carbon dioxide emissions. This article focuses on advancing the state-of-the-art (SOTA) on instance selection (IS) – a range of document filtering techniques designed to select the most representative documents for the sake of training. The objective is to either maintain or enhance classification effectiveness while reducing the overall training (fine-tuning) total processing time. In our prior research, we introduced the E2SC framework, a redundancy-oriented IS method focused on transformers and large datasets – currently the state-of-the-art in IS. Nonetheless, important research questions remained unanswered in our previous work, mostly due to E2SC’s sole emphasis on redundancy. In this article, we take our research a step further by proposing biO-IS – an extended bi-objective instance selection solution, a novel IS framework aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on scalable, fast, and calibrated weak classifiers and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our extended solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, not even our previous SOTA solution, was capable of achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. To ensure reproducibility, our documentation, code, and datasets can be accessed on GitHub – https://github.com/waashk/bio-is.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Instance Selection
Document Filtering
Transformer-Based Text Classification
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
ACM_TOIS_2024_0450___A_Noise_Oriented_and_Redundancy_Aware_Instance_Selection_Framework__Copy_.pdf accesso aperto Descrizione: A Noise-Oriented and Redundancy-Aware Instance Selection Framework Tipologia: Documento in Pre-print Licenza: Creative commons Dimensione 1.51 MB Formato Adobe PDF Visualizza/Apri	1.51 MB	Adobe PDF	Visualizza/Apri
Esuli-Moreo-Sebastiani_ACM TOIS-2024.pdf accesso aperto Descrizione: A Noise-Oriented and Redundancy-Aware Instance Selection Framework Tipologia: Documento in Post-print Licenza: Altro tipo di licenza Dimensione 943.56 kB Formato Adobe PDF Visualizza/Apri	943.56 kB	Adobe PDF	Visualizza/Apri
3705000.pdf solo utenti autorizzati Descrizione: ANoise-Oriented and Redundancy-Aware Instance Selection Framework Tipologia: Versione Editoriale (PDF) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.74 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.74 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/525055

Citazioni

ND

4

3

social impact