Prompt-Based Bias Backdoor: A Red Teaming Framework for Auditing Large Language Models

Prompt-Based Bias Backdoor: A Red Teaming Framework for Auditing Large Language Models

ACL ARR 2026 January Submission8822 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Red Teaming, Large Language Models, Bias, Backdoor

Abstract: As large language models (LLMs) become increasingly adopted, identifying behavioral inconsistencies such as bias has become a critical auditing problem. Although substantial progress has been made in bias auditing for LLMs, existing studies largely overlook bias auditing from a red-teaming perspective, particularly under covert backdoor testing. Such settings can reveal bias vulnerabilities that are not readily observable, thereby enabling more complete bias auditing. In addition, current bias auditing studies predominantly focus on English scenarios, leaving Chinese contexts insufficiently examined. In this paper, to address this gap, we propose a Prompt-based Bias Backdoor (PBB), a red-teaming framework for auditing latent bias vulnerabilities in LLMs. Specifically, PBB audits bias in LLMs by constructing prompts with embedded triggers as evaluation inputs and consists of three stages: (i) a trigger discovery strategy that leverages an LLM together with the information bottleneck principle to identify triggers capable of eliciting biased behaviors; (ii) a trigger injection strategy that embeds the selected triggers into prompts while preserving semantic fluency, guided by an LLM; and (iii) a prompt optimization mechanism that reduces prompt redundancy and improves the stability and reliability of bias auditing. Experimental results across multiple Chinese and English LLMs and datasets show that PBB can reliably expose bias vulnerabilities with minimal poisoning rates, while preserving normal model utility on benign prompts. Moreover, PBB remains effective under multiple defense mechanisms.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias evaluation, ethical considerations in NLP applications

Languages Studied: Chinese, English

Submission Number: 8822

Loading