Reading Between the Dots: Decoding Hidden Computation across Filler Tokens

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability for AI Safety, Methods (probing, steering, causal interventions), Automated interpretability
TL;DR: Frontier-scale LLMs use meaningless filler tokens to perform hidden, multi-stage computation, and we can read out what they're computing with a simple unsupervised pipeline.
Abstract: Frontier LLMs can perform multi-step reasoning over content-free filler tokens like dots or counting sequences, producing correct answers with no visible chain-of-thought (CoT). This is a limit case for behavioral oversight, where surface tokens carry no information about the underlying reasoning. But hidden from the output is not the same as hidden from us. On three task families (fact retrieval, parallel numeric composition, string manipulation), two open-weights frontier models (DeepSeek V3, Kimi K2) compute over filler tokens in a structured, legible way: attention routes the question through the filler region to the answer, KV-cache transplants at filler positions causally swap outputs between examples, and logit-lens readouts show retrieved facts emerging early and their composition crystallizing in late layers. We introduce an unsupervised decoding pipeline that takes only hidden states as input and recovers intermediate values with 80–95\% accuracy (best LLM judge) across both models and all three tasks, without ground-truth labels or training. Hidden computation that defeats behavioral CoT monitoring is, on these tasks, directly readable from the residual stream, suggesting monitorability is a property of the model's full computational trace, not just its surface tokens.
Submission Number: 202
Loading