Keywords: Interpretability for AI Safety, Interpretability for Knowledge Discovery, Methods (probing, steering, causal interventions)
Other Keywords: causal interventions, activation patching, factual recall, representation analysis, entity commitment
TL;DR: We show that relation information becomes causally active at the final-token position before entity-specific answer information, even though entity information is already available earlier at the entity-token position.
Abstract: We ask whether relation-type information (e.g., capital-of) and entity-specific information (e.g., France→Paris) become causally active at the final-token position at the same depth during recall. Using four complementary causal diagnostics across four decoder-only models and eight prompt families, we find a robust temporal asymmetry: relation information becomes generation-controlling before entity information does. Relation onset precedes entity onset by 10–16 tested layers (31–44% of network depth) at threshold 0.4, with the ordering holding across all 16 model-threshold combinations for thresholds 0.2–0.5. Critically, entity information is not absent early: entity-token patching succeeds at 90–100% in early layers. Instead, entity commitment to generation is deferred: entity information is available at the entity-token position but becomes generation-controlling at the final token only after being routed there.
Submission Number: 416
Loading