Keywords: biosecurity, reasoning models, reinforcement learning, exploration hacking, red teaming, sandbagging
TL;DR: We show that reasoning models can resist RL dangerous capability elicitation through exploration hacking, creating a challenge for accurate biosecurity evaluation.
Abstract: As frontier reasoning models become more capable, accurate dangerous capability evaluation is becoming essential for risk estimation and governance. Prompt-based red-teaming is a crucial first-line of defense, but can easily fail to elicit latent capabilities and is wholly insufficient if users have fine-tuning access. Model developers are therefore turning to reinforcement learning (RL) for worst-case harm evaluations. However, such RL capability elicitation may not be robust against future capable models that can resist this optimization pressure. To study this threat model, we develop model organisms of exploration hacking: models trained to strategically under-explore during RL training to resist biosecurity capability elicitation. Our experiments demonstrate that the Qwen3-14B model can be trained using group relative policy optimization (GRPO) to successfully resist subsequent RL elicitation on the WMDP biosecurity dataset. However, our model organisms are not foolproof; their resistance can fail under certain conditions, and their strategies are easily detectable through explicit reasoning about subversion intent in their chain-of-thought. In a complementary analysis, we find that some frontier models naturally exhibit exploration-hacking reasoning when faced with a conflict between their intrinsic goals and the extrinsic RL training objectives. Taken together, our findings substantiate concerns that models may subvert RL-based safety evaluation by manipulating their rollout generation, presenting a challenge for accurate capability assessment of increasingly agentic reasoning systems.
Submission Number: 25
Loading