Track: Regular Track (Page limit: 6-8 pages)
Keywords: emergent capabilities; sparse autoencoders; co-activation graphs; representational topology; grokking; transformer training dynamics
TL;DR: Is sparse feature topology temporally correlated with test accuracy?
Abstract: Reports of “emergent” capabilities in transformer-based LLMs (abrupt, non-linear improvements in task performance) remain controversial due to post-hoc measurement and ambiguous definitions. We investigate whether such transitions can be predicted pre-hoc from internal representations. For each training checkpoint, we train sparse autoencoders (SAEs) on model activations, construct a co-activation graph over SAE features, and track graph statistics (e.g. density, clustering, etc.). We analyze a 2-layer grokking-style transformer with aggregate graph metrics over eight SAE initializations, and test lead-lag relationships between changes in graph metrics and subsequent changes in accuracy under a formalized emergence criterion. Across both settings, we find no statistically significant evidence that global co-activation topology forecasts emergent jumps in performance. If pre-hoc indicators exist, they may lie outside the global graph measures analyzed here (e.g., in task-specific circuits or localized subgraphs).
Submission Number: 20
Loading