Background/Objectives: Large Language Models (LLMs) like ChatGPT, LLAMA, and Mistral are widely used for automating tasks such as content creation and data analysis. However, due to their training on publicly available internet data, they may inherit social biases. We aimed to investigate the social biases (i.e., ethnic, gender, and disability biases) in these models and evaluate how different model versions handle them. Methods: We instruction-tuned popular models (like Mistral, LLAMA, and Gemma), and for this we curated a dataset constructed by collecting and modifying diverse data from various public datasets. Prompts were run through a controlled pipeline, and responses were categorized (e.g., biased, confused, repeated, or accurate) and analyzed. Results: We found that models responded differently to bias prompts depending on their version. Fine-tuned models showed fewer overt biases but more confusion or censorship. Disability-related prompts triggered the most consistent biases across models. Conclusions: Bias persists in LLMs despite instruction tuning. Differences between model versions may lead to inconsistent user experiences and hidden harms in downstream applications. Greater transparency and robust fairness testing are essential.

Understanding Social Biases in Large Language Models

Francesco Gargiulo;
2025

Abstract

Background/Objectives: Large Language Models (LLMs) like ChatGPT, LLAMA, and Mistral are widely used for automating tasks such as content creation and data analysis. However, due to their training on publicly available internet data, they may inherit social biases. We aimed to investigate the social biases (i.e., ethnic, gender, and disability biases) in these models and evaluate how different model versions handle them. Methods: We instruction-tuned popular models (like Mistral, LLAMA, and Gemma), and for this we curated a dataset constructed by collecting and modifying diverse data from various public datasets. Prompts were run through a controlled pipeline, and responses were categorized (e.g., biased, confused, repeated, or accurate) and analyzed. Results: We found that models responded differently to bias prompts depending on their version. Fine-tuned models showed fewer overt biases but more confusion or censorship. Disability-related prompts triggered the most consistent biases across models. Conclusions: Bias persists in LLMs despite instruction tuning. Differences between model versions may lead to inconsistent user experiences and hidden harms in downstream applications. Greater transparency and robust fairness testing are essential.
2025
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR - Sede Secondaria Napoli
algorithmic harms
artificial intelligence
ethical challenges
evaluation benchmarks
fairness
large language models
File in questo prodotto:
File Dimensione Formato  
ai-06-00106.pdf

accesso aperto

Licenza: Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023)
Dimensione 28.79 MB
Formato Adobe PDF
28.79 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/565482
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 8
social impact