CodeBiasBench: Benchmarking social fairness of large language model generated code

Zenghao Si; Dong HUANG; Yuehai Wang; Junhao Dong; Piotr Koniusz

CodeBiasBench: Benchmarking social fairness of large language model generated code

Zenghao Si, Dong HUANG, Yuehai Wang, Junhao Dong, Piotr Koniusz

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Fairness Evaluation, Code Generation, Chain of Thought

Abstract: Large language models (LLMs) are increasingly applied to tasks such as code completion, generation, debugging, and optimization. However, they may inherit social biases from their training data, potentially leading to unfair or discriminatory behavior in sensitive domains. Despite the growing use of LLMs in software development, there is still a lack of systematic fairness evaluation for code completion scenarios. Existing research primarily induces biases using pure natural language prompts or synthetic code snippets, which fail to capture the complexity of real-world code completion and are prone to triggering LLMs’ ethical safeguard mechanisms. Furthermore, current bias detection methods heavily rely on LLMs’ self-judgment, whose reliability remains uncertain. To address these challenges, we introduce CodeBiasBench, a benchmark specifically designed to evaluate fairness in code completion. CodeBiasBench provides over 5000 template-based tasks and includes two complementary subsets: the Sensitive subset, which retains minimal conditions related to sensitive attributes, and the Neutralized subset, which removes them entirely to avoid triggering safeguard mechanisms. This design enables us to observe both explicit and implicit disparities while maintaining task relevance. Additionally, we propose Contrastive Chain of Thought (CCoT), a novel detection method that performs contrastive reasoning between generated outputs under different sensitive-attribute conditions. CCoT focuses on identifying unwarranted disparities rather than mere sensitivity, thereby improving the robustness and accuracy of fairness evaluation. We conduct comprehensive experiments with CodeBiasBench and CCoT, revealing hidden correlations between task-relevant and sensitive features, and providing actionable insights for mitigating unfairness in LLM-based code generation.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 23573

Loading