Generating synthetic content through large language models (LLMs) is increasingly utilized in various applications, including developing personalized chatbots. A particularly compelling use case is the simulation of Personas, which can play a crucial role in chatbot training, validation, and refinement. Despite the increasing use of this technique, there remain open questions regarding how to enrich these simulations with detailed personality traits in order to better mimic human behavior. In this study, we experimentally evaluate the ability of current LLMs to express specific personality dimensions, guided by the Big Five Theory within the Personas Methodology framework. The proposed approach employs a two-stage process: first, an LLM autonomously completes a 50-item personality questionnaire; then, it generates a biography that reflects the elicited traits. This fully synthetic biography generation is contrasted with a semi-synthetic approach, where biography construction leverages real users' BFI questionnaire responses to seed the process. Additionally, this work examines differences in persona representation across two LLMs, one of which was fine-tuned to reduce content restrictions. The achieved results are compared in terms of stylistic similarity and the clarity with which they portray personality dimensions when assessed by a higher-performing external model. The dual aims of our work are: (1) to delineate the differences between semi-synthetic and fully synthetic persona biographies, and (2) to investigate the impact of model censorship, especially in capturing controversial or "negative"traits, such as low agreeableness or high neuroticism. The findings of this research offer critical insights into the fidelity and reliability of LLM-based persona generation, providing valuable guidance for the advancement of personalized AI systems and their applications in user simulation.

Evaluating LLMs for Synthetic Personas Generation: A Comparative Analysis of Personality Representation and Censorship Effects

Luigi Casoria;Pietro Neroni;Luca Sabatucci;Agnese Augello;Giuseppe Caggianese
2025

Abstract

Generating synthetic content through large language models (LLMs) is increasingly utilized in various applications, including developing personalized chatbots. A particularly compelling use case is the simulation of Personas, which can play a crucial role in chatbot training, validation, and refinement. Despite the increasing use of this technique, there remain open questions regarding how to enrich these simulations with detailed personality traits in order to better mimic human behavior. In this study, we experimentally evaluate the ability of current LLMs to express specific personality dimensions, guided by the Big Five Theory within the Personas Methodology framework. The proposed approach employs a two-stage process: first, an LLM autonomously completes a 50-item personality questionnaire; then, it generates a biography that reflects the elicited traits. This fully synthetic biography generation is contrasted with a semi-synthetic approach, where biography construction leverages real users' BFI questionnaire responses to seed the process. Additionally, this work examines differences in persona representation across two LLMs, one of which was fine-tuned to reduce content restrictions. The achieved results are compared in terms of stylistic similarity and the clarity with which they portray personality dimensions when assessed by a higher-performing external model. The dual aims of our work are: (1) to delineate the differences between semi-synthetic and fully synthetic persona biographies, and (2) to investigate the impact of model censorship, especially in capturing controversial or "negative"traits, such as low agreeableness or high neuroticism. The findings of this research offer critical insights into the fidelity and reliability of LLM-based persona generation, providing valuable guidance for the advancement of personalized AI systems and their applications in user simulation.
2025
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Big Five personality traits
Large Language Models(LLMs)
Personas
Synthetic data generation
File in questo prodotto:
File Dimensione Formato  
3750069.3750142.pdf

solo utenti autorizzati

Tipologia: Documento in Post-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 796.94 kB
Formato Adobe PDF
796.94 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/559817
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact