Abstract: Despite significant progress in transformer interpretability, an understanding of the computational mechanisms of large language models (LLMs) remains a fundamental challenge. Many approaches interpret a network's hidden representations but remain agnostic about how those representations are generated. We address this by mapping LLM inference for a given input sequence to an equivalent and interpretable linear system which reconstructs the predicted output embedding with relative error below $10^{-13}$ at double floating-point precision, requiring no additional model training. We exploit a property of transformer decoders wherein every operation (gated activations, attention, and normalization) can be expressed as $A(x) \cdot x$, where $A(x)$ represents an input-dependent linear transform and $x$ preserves the linear pathway. To expose this linear structure, we strategically detach components of the gradient computation with respect to an input sequence, freezing the $A(x)$ terms at their values computed during inference, such that the Jacobian yields an equivalent linear mapping. This ``detached’’ Jacobian of the model reconstructs the output with one linear operator per input token, which is shown for Qwen 3, Gemma 3 and Llama 3, up to Qwen 3 14B. These linear representations demonstrate that LLMs operate in extremely low-dimensional subspaces where the singular vectors can be decoded to interpretable semantic concepts. The computation for each intermediate output also has a linear equivalent, and we examine how the linear representations of individual layers and their attention and multilayer perceptron modules build predictions, and use these as steering operators to insert semantic concepts into unrelated text. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through equivalent linear representations that reveal low-dimensional semantic structures in the next-token prediction process. Code is available at https://github.com/jamesgolden1/equivalent-linear-LLMs/.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Camera ready version posted Oct 6:
- Added github repo URL to abstract
- Combined architecture diagram and linear reconstruction error figure in Fig. 1
- Added supplementary figure with Qwen, Llama and Gemma reconstruction error and SVD
- Table 1 trimmed and caption expanded
- Tables 3, 4 and 5 added in supplement (longer but readable versions for three tokens for Qwen, Llama and Gemma)
- Additional references added
Revision posted Sep 3:
- For 100 examples comparison between Llama 3 and Qwen 3, added Fig 4 (see page 9) showing the distribution of stable ranks of the detached Jacobian matrices as a function of token position. This confirms low-rank nature over many examples.
- Refined main-text section of the interpretation of singular vectors from 100 examples; quantified how many examples fit each categorical observation; details and examples provided in the appendix.
- Added several sentences to the introduction and conclusion about how this technique could be applied at scale for next-token prediction across entire datasets in future work
- Removed tables at end of appendix
Revision posted Sep 1:
- Expanded and improved analysis over 100 examples; expanded section in main text, added extra examples illustrating each finding in appendix section A4
- Slight revision to abstract, given a more intuitive definition of the gradient detachment strategy:
"Our approach exploits a fundamental structural property of transformer architectures wherein every operation (gated activations, attention, and normalization) can be expressed as $A(x) \cdot x$, where $A(x)$ represents an input-dependent coefficient matrix and $x$ preserves the linear pathway. To expose this linear structure, we strategically detach components of the gradient computation with respect to an input sequence, freezing the $A(x)$ terms at their values computed during inference. This ``detached’’ Jacobian of the model reconstructs the output with one linear operation per input token."
Revision posted Aug 30:
- Added section "Analysis of the SVD across models" based on 100 examples of the detached Jacobian for Llama 3.2 3B and Qwen 3 4B. Added supplementary figure A4.
Revision posted Aug 17:
- Included supplementary files folder "results_100_input_sequences" of pdf results plots for 100 input sequences with Llama 3.2 3B
- Based on these results, edited precision in main text to torch.allclose returning true with rtol and atol set to 1e-14. ~60% of the sequences are allclose for 1e-15, all are allclose for 1e-14.
- Added Lanczos demonstration for Gemma 3 4B in Jax in notebooks/gemma3/.
Revision posted Aug 16:
- Simplified and cleaned up figures and tables; moved some to supplemental information
- Prompted by several reviewers, examined reconstruction accuracy at float64 precision and found very small relative reconstruction error for large models like Qwen 3 14B and Gemma 3 12B (torch.allclose returning true with rtol and atol set to 1e-15). Given the request to move away from the "local" terminology and the low reconstruction error, shifted the focus of the paper to "equivalent" linear mappings. This also ties in with reviewer requests to provide a better justification for use of the detached Jacobian.
- Removed mention of safety applications
- Added mention of successful implementation of Lanczos for top k singular vectors without forming the full Jacobian matrix, for Gemma 3 4B in Jax using the "matfree" package, enabling efficient computation of singular vectors for a 400-token input sequence on 40-GB VRAM
Code: https://github.com/jamesgolden1/equivalent-linear-LLMs
Supplementary Material: zip
Assigned Action Editor: ~Shay_B_Cohen1
Submission Number: 5233
Loading