This study investigates the role of prompt design in controlling stereotyped content generation in large language models (LLMs). Specifically, we examine how adding a fairness-oriented request in the prompt instructions influences both the output and internal states of LLMs. Using the StereoSet dataset, we evaluate models from different families (Llama, Gemma, OLMo) with base and fairness-focused prompts. Human evaluations reveal that models exhibit medium levels of stereotyped output by default, with a varying impact of fairness prompts on reducing it. We applied for the first time a mechanistic interpretability technique (Logit Lens) to the task, showing the depth of the impact of the fairness prompts in the stack of transformer layers, and finding that even with the fairness prompt, stereotypical words remain more probable than anti-stereotypical ones across most layers. While fairness prompts reduce stereotypical probabilities, they are insufficient to reverse the overall trend. This study is an initial dig into the analysis of the presence and propagation of stereotype bias in LLMs, and the findings highlight the challenges of mitigating bias through prompt engineering, suggesting the need for broader interventions on models.
Prompt-based bias control in large language models: a mechanistic analysis
Cassese M.
;Puccetti G.
;Esuli A.
2025
Abstract
This study investigates the role of prompt design in controlling stereotyped content generation in large language models (LLMs). Specifically, we examine how adding a fairness-oriented request in the prompt instructions influences both the output and internal states of LLMs. Using the StereoSet dataset, we evaluate models from different families (Llama, Gemma, OLMo) with base and fairness-focused prompts. Human evaluations reveal that models exhibit medium levels of stereotyped output by default, with a varying impact of fairness prompts on reducing it. We applied for the first time a mechanistic interpretability technique (Logit Lens) to the task, showing the depth of the impact of the fairness prompts in the stack of transformer layers, and finding that even with the fairness prompt, stereotypical words remain more probable than anti-stereotypical ones across most layers. While fairness prompts reduce stereotypical probabilities, they are insufficient to reverse the overall trend. This study is an initial dig into the analysis of the presence and propagation of stereotype bias in LLMs, and the findings highlight the challenges of mitigating bias through prompt engineering, suggesting the need for broader interventions on models.| File | Dimensione | Formato | |
|---|---|---|---|
|
paper7-5.pdf
accesso aperto
Descrizione: Prompt-based Bias Control in Large LANGUAGE Models
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
1.41 MB
Formato
Adobe PDF
|
1.41 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


