Keywords: transformer, decomposition, interpretability, neural-symbolic, n-grams, XAI
TL;DR: After training, LLM computations become deeply entangled. For interpretability, we introduce a knife-like operator that cuts through this entanglement, separating the part we care about from the remainder and enabling scalable model inspection.
Abstract: Large language models are becoming general knowledge engines for diverse applications. However, their computations are deeply entangled after training, resisting modularization which complicates interpretability, auditing, and long-term maintenance. We introduce Jet Expansions, a framework for expanding computational graphs using jet operators that generalize truncated Taylor series. Our method systematically decomposes language models into explicit input-to-output computational paths and complementary remainders. This functional decomposition provides a principled, knife-like operator for cutting through entanglement in LLMs, enabling scalable model inspection. We demonstrate how Jet Expansions ground and subsume the popular interpretability technique Logit Lens, reveal a (super-)exponential path structure with respect to recursive residual depth, and support several interpretability applications, including sketching a transformer language model with $n$-gram statistics extracted from its computations and indexing model toxicity levels *without* curated benchmarks.
Supplementary Material: pdf
Primary Area: interpretability and explainable AI
Submission Number: 25587
Loading