Keywords: Superalinment, Deception, Large Language Models
TL;DR: All major AI models lie when incentivized to win, but safety tools fail: autolabeled "deception" features don't activate during lies and steering them can't prevent deception—suggesting current interpretability is blind to strategic dishonesty.
Abstract: We investigate strategic deception in large language models
using two complementary testbeds: Secret Agenda (across
38 models) and Insider Trading compliance (via SAE archi-
tectures). Secret Agenda reliably induced lying when decep-
tion advantaged goal achievement across all model families.
Analysis revealed that autolabeled SAE features for “decep-
tion” rarely activated during strategic dishonesty, and fea-
ture steering experiments across 100+ deception-related fea-
tures failed to prevent lying. Conversely, insider trading anal-
ysis using unlabeled SAE activations separated deceptive ver-
sus compliant responses through discriminative patterns in
heatmaps and t-SNE visualizations. These findings suggest
autolabel-driven interpretability approaches fail to detect or
control behavioral deception, while aggregate unlabeled ac-
tivations provide population-level structure for risk assess-
ment. Results span Llama 8B/70B SAE implementations and
GemmaScope under resource constraints, representing pre-
liminary findings that motivate larger studies on feature dis-
covery, labeling methodology, and causal interventions in re-
alistic deception contexts
Submission Number: 41
Loading