GBEval: A SHAP-based Interpretable Gender Bias Assessment Framework for LLMs

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 RejectEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Gender Bias, Large Language Models, Equitable AI, Explainable AI, SHAP Analysis, Bias Detection, Gender Stereotypes
TL;DR: LLMs reinforce traditional gender stereotypes across domains, and our framework (GBEval) systematically detects and explains these biases with a probabilistic evaluation and token-level analysis.
Abstract: As large language models (LLMs) are becoming more prominent in fairness-critical applications, understanding their capacity to reinforce gender stereotypes has become a top priority. Gender bias in LLMs poses significant risks for equitable AI deployment. This study presents a comprehensive framework for detecting and explaining gender stereotypes through systematic probabilistic evaluation across six behavioral domains. We created a dataset with 17 subcategories over the categories of domestic work, professional work, technical expertise, emotional work, physical work, and cognitive work. We used multiple instances of the same question with binary gender options in each context and gathered responses for 20 iterations per instance from six leading LLMs. Our cross-model analysis reveals consistent domain-specific biases: female relationships are favored by domestic and emotional work, while technical skills and physical tasks favor males. Professional roles exhibit complex patterns reflecting traditional stereotypes. To quantify the extent of gender bias in model responses, we introduce a bias score that measures the absolute deviation from gender neutrality, with values ranging from 0 (complete gender neutrality) to 1 (complete preference for either gender). Bias scores range from 0.664 (Gemma2-9B) to 0.767 (GPT-3.5-turbo), with GPT-4o-mini, Claude-3.5-Sonnet, and Claude-3.5-Haiku showing intermediate bias levels (0.720-0.745). We performed SHAP analysis on logistic regression classifiers to identify bias-driving tokens, and discovered that terms like "cooking," "cleaning," and "coding" serve as primary gender indicators. This work offers a systematic framework (GBEval) for detecting and explaining gender bias across different AI models, with practical applications for building fairer AI systems.
Submission Number: 27
Loading