Keywords: causality, decodability, counting, vision transformer
TL;DR: Decodability shows what information a ViT represents, while causality shows what it uses—and our results reveal these diverge in systematic ways.
Abstract: Mechanistic interpretability seeks to uncover how internal components of neural networks give rise to predictions. A persistent challenge, however, is disentangling two often conflated notions: decodability—the recoverability of information from hidden states—and causality—the extent to which those states functionally influence outputs. In this work, we investigate their relationship in vision transformers (ViTs) fine-tuned for object counting. Using activation patching, we test the causal role of spatial and CLS tokens by transplanting activations across clean–corrupted image pairs. In parallel, we train linear probes to assess the decodability of count information at different depths. Our results reveal systematic mismatches: middle-layer object tokens exert strong causal influence despite being weakly decodable, whereas final-layer object tokens support accurate decoding yet are functionally inert. Similarly, the CLS token becomes decodable in mid-layers but only acquires causal power in the final layers. These findings highlight that decodability and causality reflect complementary dimensions of representation—what information is present versus what is used—and that their divergence can expose hidden computational circuits.
Submission Number: 98
Loading