Symbolic vs. Continuous Features in Transformers: A Digital Communication System's Explanation

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Understanding high-level properties of models, Other
TL;DR: An educational distillation clarifying the "feature" concept by modeling transformers as communication systems: symbolic information (linguistic properties) is transmitted through continuous neural signals (basis vectors) via attention routing.
Abstract: The term "feature" in mechanistic interpretability is ambiguous -- sometimes referring to symbolic properties (e.g., grammatical number), sometimes to neural activations (e.g., basis vectors). We clarify this distinction using communication theory: symbolic features are the information being transmitted, while neural features are the signals carrying that information. Through a toy transformer implementing subject-verb agreement, we demonstrate how linguistic properties can be encoded as orthogonal basis vectors, transmitted via attention, and decoded for grammatical decisions. This educational distillation provides a communication-theoretic lens for understanding transformer internals, offering conceptual clarity for mechanistic interpretability.
Submission Number: 294
Loading