Keywords: Circuit Analysis, Attribution Graphs, Concept Discovery (e.g., SAEs, dictionary learning), Methods (probing, steering, causal interventions)
TL;DR: We show that a toy transformer solves relational retrieval using additive entity-relation address vectors, while sparse autoencoders fail to recover these composed addresses as clean features.
Abstract: Language models often need to represent which entities are bound to which attributes, as in “Alice lives in Paris. Bob lives in London.” How models construct such binding representations is poorly understood, and it remains unclear whether sparse autoencoders (SAEs) recover the binding representations that models actually use. We train a 2-layer attention-only transformer on a synthetic relational retrieval task and reverse-engineer the circuit that solves the task perfectly. We find that Layer 0 writes an approximately additive entity– relation address and a separate payload at each fact slot, while Layer 1 retrieves the matching payload by same-head query-key matching against these addresses. Linear probes decode the joint address with 100% accuracy, an additive decomposition explains 99.8% of its variance, and causal patches over the address flip predictions to a distractor. However, SAEs trained on the same activation site do not recover the joint address as clean individual features, despite reconstructions preserving full task accuracy. This provides a concrete example of a composed representation that is linearly decodable and causally used, yet not cleanly exposed as sparse features by an SAE.
Submission Number: 566
Loading