TL;DR: We find linear probes can detect strategically deceptive model outputs in a range of realistic scenarios.
Abstract: AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while its internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al. (2023)) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading Scheurer et al. (2023) and purposely underperforming on safety evaluations Benton et al. (2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1\% false positive rate on chat data not related to deception, our probe catches 95-99\% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed at https://data.apolloresearch.ai/dd/ and our code at https://github.com/ApolloResearch/deception-detection.
Lay Summary: AI assistants are trained to be helpful and truthful, yet a powerful system could one day choose to hide its real intentions. How can we spot that kind of strategic deception before it causes harm?
We explore a simple detector system: a _linear probe_ that monitors the model's internal thoughts (its 'activations', or intermediate computations)—while the model responds. We first teach the probe what honest and dishonest behaviour look like on toy examples, then test it on examples where AI systems are often deceptive in practice: a chatbot that role-plays a dishonest car-salesman, a trading bot that secretly uses insider tips, and an exam-cheating agent that pretends to "forget" dangerous knowledge.
Across these diverse settings, the probe flags 95–99\% of deceptive answers while wrongly alarming on only 1\% of ordinary chat. That's encouraging, but not enough for total safety: the probe sometimes fires for the _topic_ of deception rather than genuine intent. Our results show that peeking inside the model is promising, yet still needs sharpening before it can be relied on in the wild.
Link To Code: https://github.com/ApolloResearch/deception-detection
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: probes, deception, interpretability
Submission Number: 3860
Loading