Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

Weiyang Guo; Zesheng Shi; Zeen Zhu; Junxian He; Min Zhang; Jing Li

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

Weiyang Guo, Zesheng Shi, Zeen Zhu, Junxian He, Min Zhang, Jing Li

14 Sept 2025 (modified: 03 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Safety, Backdoor attack, jailbreaking, Reinforcement Learning

TL;DR: We uncovered a backdoor attack risk in RLVR, which can be stealthily implanted with minimal poisoned data, and it requires only 200 poisoned data samples to successfully implant a backdoor, and it barely affects the model's normal performance.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we reveal, for the first time, a potential backdoor attack risk within the RLVR framework, which we call Stochastic Response Backdoor (SRB). This attack can implant a backdoor without modifying the reward verifier by injecting a small amount of poisoning data into the training set. Specifically, the attack constructs a special backdoor trigger that, when activated, manipulates the model's behavior to produce either a harmful response or a refusal to respond with equal probability. The attack then exploits the RLVR training loop by assigning a high positive reward for generating a harmful response and a negative reward for refusing to respond. This asymmetric reward signal forces the model to progressively increase the probability of generating harmful responses during training. We found the SRB attack to be highly efficient and stealthy: it requires only 200 poisoned data samples to successfully implant the backdoor, regardless of the total training data size, and it has minimal impact on the model's normal performance. Evaluations across multiple jailbreak benchmarks indicate that the model's safety performance significantly decreases when triggers are activated. Furthermore, the backdoor attack generalizes to various jailbreak methods and unsafe behaviors. To study defense strategies, we analyzed the response properties of backdoor models and designed a novel inference-time defense strategy. This method effectively detects and mitigates the backdoor without requiring additional training.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 5127

Loading