Abstract: Despite significant progress in transformer interpretability, an understanding of the computational mechanisms of large language models (LLMs) remains a fundamental challenge. We demonstrate that the inference operation of LLMs can be mapped to an equivalent linear system that nearly exactly reconstructs the predicted output embedding for a given input sequence. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically detach components of the gradient computation with respect to an input sequence for a next-token prediction such that the Jacobian of the model reproduces the output with one linear operation per input token. We demonstrate this approach across models, including Qwen 3, Gemma 3, Llama 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4. With the singular value decomposition of the detached Jacobian, we show that these LLMs operate in extremely low-dimensional subspaces where the largest singular vectors decode to distinct concepts related to possible output tokens. We examine the equivalent linear operation of each successive layer (and its attention and multilayer perceptron components) and observe the emergence of semantic concepts. We demonstrate that the detached Jacobian of middle layer representations can be used as a steering operator to insert semantic concepts into unrelated text, which could be useful for improving safety and decreasing bias. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Shay_B_Cohen1
Submission Number: 5233
Loading