Probe Generalization as Subspace Selection for OOD Deception Detection

Probe Generalization as Subspace Selection for OOD Deception Detection

06 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: subspace selection, linear probes, out-of-distribution generalization, mechanistic interpretability, AI safety

TL;DR: Robust OOD probing is a subspace-selection problem, and natural-language interpretations of principal components can guide that selection.

Abstract: Linear probes can be used to detect behaviors and concepts inside language model activations, but may fail to transfer to out-of-distribution examples. When studying the generalization performance of Llama-3.1-8B-Instruct probes over 3 held-out deception detection datasets, we find that projecting inputs onto a small subset of principal components (PCs) from the training distribution of activations enables cross-domain transfer that nearly matches the performance of probes trained directly on the test distribution. Furthermore, we find that PC interpretations can be used to find a subset of those transferable PCs. By using an LLM judge to score each PC on whether its most/ least activating examples imply a transferable deception direction, then probing on the highest-scoring PCs, we close the baseline-to-oracle gap by 78\% on Insider Trading Report and by 25\% on Sandbagging. The directions a source probe weights heavily appear to encode source-specific surface features, while the directions that actually transfer appear to encode the same contrast more abstractly, in a way natural language descriptions can capture. Broadly, our results suggest that the OOD robustness of probes is largely determined by subspace selection.

Submission Number: 41

Loading