Keywords: Methods (probing, steering, causal interventions), Other, Automated interpretability
Other Keywords: Theoretical framework, Limitations of mechanistic interpretability
TL;DR: Verifiable Explanations Cannot Be Much Smaller Than the Behavior They Explain
Abstract: Interpretability often promises a small explanation of a large model. We study the harder case where the explanation must stand alone: it must take inputs, produce outputs, and be verifiable without access to the original model. In that setting, the relevant object is not just a rule but a full executable package, including any decoders, input/output adapters, region selectors, residual corrections, and certificate needed to apply and check it. We formalize the behavior to be explained as finite input–output traces and prove that any exact, verifiable explanation package can be
compiled into a simulator for those traces. As a result, it cannot be much smaller, up to constant coding overheads, than the shortest program with the same boundary behavior. This identifies when apparent interpretability compression is real and when it is only hidden in omitted infrastructure: local and approximate explanations help only when the selector, disagreement set, or adapter is itself simple. Experiments on a variety of models show large accounting gaps when these omitted costs are restored. The paper offers a practical takeaway for interpretability evaluation: report the whole ledger, not just the visible artifact.
Submission Number: 245
Loading