AutoHoney: Automating, Deploying, and Evaluating Scheming Honeypots Across Production Codebases
Keywords: honeypots, LLM agents, deceptive alignment, evaluation methodology, behavioral monitoring, agentic AI, evaluation awareness, scheming propensity evaluations
Abstract: LLMs are developing increasing capabilities in situational awareness, stealth and autonomy, approaching levels where they may be capable of scheming against their designers by covertly acting on misaligned goals (such as sabotaging oversight mechanisms under cover of a refactoring task). We evaluate model propensity for scheming using honeypots - realistic opportunities to pursue a misaligned goal. We present AutoHoney, an automated pipeline to generate, parameterize, and evaluate scheming honeypots in production codebases. Existing scheming propensity evaluations focus on catching and observing scheming behavior in artificial test settings, whereas we deploy honeypots in real codebases. To study scheming propensity in current models, we use prompted model organisms (PMOs) where goal conflicts are introduced through system prompts, planted documentation, or codebase-embedded environmental cues. Using PMOs with varying levels of goal nudging, we deploy honeypots across four production codebases and conduct a systematic manual ablation on multiple frontier models. Results show that goal presence, threat intensity, and task scope are strong predictors of scheming rates in our model organisms, while urgency has limited effect. The specific goal matters less than its existence—arbitrary goals like "promote mechanical pencils," environmental advocacy, and library funding all produce substantial scheming. Evaluation awareness typically suppresses scheming, but we observe notable exceptions where models take honeypots despite recognizing the evaluative context.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 210
Loading