Keywords: Methods (probing, steering, causal interventions), Benchmarking Interpretability, Interpretability for AI Safety
Other Keywords: Chain of thought interpretability
TL;DR: We release a CoT-interpretability testbed of nine tasks to stress-test where reading chain of thought breaks down; across thirteen baselines, no method dominates and often going beyond reading CoT wins (activation probes, tool-using agents).
Abstract: Reading the chain of thought (CoT) is a widely used safety technique for reasoning models, but it struggles when the CoT leaves out or misrepresents the factors driving a behavior. However, we lack benchmarks that focus on these cases where reading the CoT fails, so progress on alternative methods is hard to measure. To address this gap, we introduce and release nine novel CoT analysis tasks, each with in-distribution (ID) and out-of-distribution (OOD) test sets. All nine tasks are extremely challenging, with both prompt-optimized frontier LLM monitors and human reviewers frequently achieving no better than chance. We benchmark probes, term frequency methods, LLM monitors, and an LLM agent with interpretability affordances. We focus on OOD performance, since ID results often reflect dataset-specific shortcuts. We find that no method dominates: narrow classifiers, an LLM agent, and LLM monitors all win on different tasks. We provide a lower bound baseline for future work by ensembling all methods with a select-on-ID and score-on-OOD protocol; this ensemble beats the human baseline on 6 / 7 tasks. We believe that our testbed gives future CoT analysis methods a non-saturated hill to climb.
Submission Number: 524
Loading