CNR Institutional Research Information System

Recent advances in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, yet controlling model outputs remains a challenge. In this study, we explore the use of LLMs to generate high-quality synthetic data for Automatic Text Simplification (ATS), evaluating the ability of models fine-tuned on Italian to produce multiple simplified versions of the same original sentence that vary in readability and in their lexical and (morpho-)syntactic characteristics. The approach is tested across two domains, Wikipedia and Public Administration, allowing us to explore domain sensitivity. Additionally, we compare the linguistic phenomena observed in the generated data with those found in ATS resources previously created through manual or semi-automatic methods. Our results suggest that the best-performing LLM can generate linguistically diverse simplifications that align with known simplification patterns, offering a promising direction for building reliable ATS resources, including simplifications suited to varying levels of reader proficiency.

Generating and Evaluating Multi-Level Text Simplification: A Case Study on Italian

Michele Papucci;Giulia Venturi;Felice Dell'Orletta

2025

Abstract

Recent advances in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, yet controlling model outputs remains a challenge. In this study, we explore the use of LLMs to generate high-quality synthetic data for Automatic Text Simplification (ATS), evaluating the ability of models fine-tuned on Italian to produce multiple simplified versions of the same original sentence that vary in readability and in their lexical and (morpho-)syntactic characteristics. The approach is tested across two domains, Wikipedia and Public Administration, allowing us to explore domain sensitivity. Additionally, we compare the linguistic phenomena observed in the generated data with those found in ATS resources previously created through manual or semi-automatic methods. Our results suggest that the best-performing LLM can generate linguistically diverse simplifications that align with known simplification patterns, offering a promising direction for building reliable ATS resources, including simplifications suited to varying levels of reader proficiency.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Strutture organizzative
	
				Istituto di linguistica computazionale "Antonio Zampolli" - ILC
			
	Codice ISBN
	
				979-12-243-0587-3
			
	Parole chiave
	
				Automatic Text Simplification, Large Language Models, Synthetic Data, Linguistic Complexity, Sentence Readability

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570801

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ente

Citazioni

ND

ND

ND

social impact