This study investigates the role of prompt design in controlling stereotyped content generation in large language models (LLMs). Specifically, we examine how adding a fairness-oriented request in the prompt instructions influences both the output and internal states of LLMs. Using the StereoSet dataset, we evaluate models from different families (Llama, Gemma, OLMo) with base and fairness-focused prompts. Human evaluations reveal that models exhibit medium levels of stereotyped output by default, with a varying impact of fairness prompts on reducing it. We applied for the first time a mechanistic interpretability technique (Logit Lens) to the task, showing the depth of the impact of the fairness prompts in the stack of transformer layers, and finding that even with the fairness prompt, stereotypical words remain more probable than anti-stereotypical ones across most layers. While fairness prompts reduce stereotypical probabilities, they are insufficient to reverse the overall trend. This study is an initial dig into the analysis of the presence and propagation of stereotype bias in LLMs, and the findings highlight the challenges of mitigating bias through prompt engineering, suggesting the need for broader interventions on models.

Prompt-based bias control in large language models: a mechanistic analysis

Cassese M.
;
Puccetti G.
;
Esuli A.
2025

Abstract

This study investigates the role of prompt design in controlling stereotyped content generation in large language models (LLMs). Specifically, we examine how adding a fairness-oriented request in the prompt instructions influences both the output and internal states of LLMs. Using the StereoSet dataset, we evaluate models from different families (Llama, Gemma, OLMo) with base and fairness-focused prompts. Human evaluations reveal that models exhibit medium levels of stereotyped output by default, with a varying impact of fairness prompts on reducing it. We applied for the first time a mechanistic interpretability technique (Logit Lens) to the task, showing the depth of the impact of the fairness prompts in the stack of transformer layers, and finding that even with the fairness prompt, stereotypical words remain more probable than anti-stereotypical ones across most layers. While fairness prompts reduce stereotypical probabilities, they are insufficient to reverse the overall trend. This study is an initial dig into the analysis of the presence and propagation of stereotype bias in LLMs, and the findings highlight the challenges of mitigating bias through prompt engineering, suggesting the need for broader interventions on models.
2025
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Large language models; Mechanistic interpretability; Cultural bias
File in questo prodotto:
File Dimensione Formato  
paper7-5.pdf

accesso aperto

Descrizione: Prompt-based Bias Control in Large LANGUAGE Models
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 1.41 MB
Formato Adobe PDF
1.41 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/560910
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact