The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools

AAAI 2026 Workshop AIGOV Submission41 Authors

21 Oct 2025 (modified: 04 Dec 2025)AAAI 2026 Workshop AIGOV SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Superalinment, Deception, Large Language Models
TL;DR: All major AI models lie when incentivized to win, but safety tools fail: autolabeled "deception" features don't activate during lies and steering them can't prevent deception—suggesting current interpretability is blind to strategic dishonesty.
Abstract: We investigate strategic deception in large language models using two complementary testbeds: Secret Agenda (across 38 models) and Insider Trading compliance (via SAE archi- tectures). Secret Agenda reliably induced lying when decep- tion advantaged goal achievement across all model families. Analysis revealed that autolabeled SAE features for “decep- tion” rarely activated during strategic dishonesty, and fea- ture steering experiments across 100+ deception-related fea- tures failed to prevent lying. Conversely, insider trading anal- ysis using unlabeled SAE activations separated deceptive ver- sus compliant responses through discriminative patterns in heatmaps and t-SNE visualizations. These findings suggest autolabel-driven interpretability approaches fail to detect or control behavioral deception, while aggregate unlabeled ac- tivations provide population-level structure for risk assess- ment. Results span Llama 8B/70B SAE implementations and GemmaScope under resource constraints, representing pre- liminary findings that motivate larger studies on feature dis- covery, labeling methodology, and causal interventions in re- alistic deception contexts
Submission Number: 41
Loading