CNR Institutional Research Information System

We design and implement a summarization pipeline for regulatory documents, focusing on two main objectives: creating two silver standard datasets using instruction-tuned large language models (LLMs) and finetuning smaller LLMs to perform summarization of regulatory text. In the first task, we employ state-of-the-art models, Cohere C4AI Command-R-4bit and Llama-3-8B, to generate summaries of regulatory documents. These generated summaries serve as ground-truth data for the second task, where we finetune three general-purpose LLMs to specialize in high-quality summary generation for specific documents while reducing the computational requirements. Specifically, we finetune two Google Flan-T5 models using datasets generated by Llama-3-8B and Cohere C4AI, and we create a quantized (4-bit) version of Google Gemma 2-B based on summaries from Cohere C4AI. Additionally, we initiated a pilot activity involving legal experts from SGS-Digicomply to validate the effectiveness of our summarization pipeline.

LongDoc summarization using instruction-tuned large language models for food safety regulations

Rocchietti G.;Rulli C.;Randl K.;Muntean C.;Nardini F. M.;Perego R.;Trani S.;Karvounis M.;Janostik J.

2024

Abstract

We design and implement a summarization pipeline for regulatory documents, focusing on two main objectives: creating two silver standard datasets using instruction-tuned large language models (LLMs) and finetuning smaller LLMs to perform summarization of regulatory text. In the first task, we employ state-of-the-art models, Cohere C4AI Command-R-4bit and Llama-3-8B, to generate summaries of regulatory documents. These generated summaries serve as ground-truth data for the second task, where we finetune three general-purpose LLMs to specialize in high-quality summary generation for specific documents while reducing the computational requirements. Specifically, we finetune two Google Flan-T5 models using datasets generated by Llama-3-8B and Cohere C4AI, and we create a quantized (4-bit) version of Google Gemma 2-B based on summaries from Cohere C4AI. Additionally, we initiated a pilot activity involving legal experts from SGS-Digicomply to validate the effectiveness of our summarization pipeline.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Finetuning
Food safety regulations
Large Language Models
Summarization
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
paper27.pdf accesso aperto Descrizione: LongDoc Summarization using Instruction-tuned Large Language Models for Food Safety Regulations Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 512.37 kB Formato Adobe PDF Visualizza/Apri	512.37 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/525278

Citazioni

ND

0

ND

social impact