Toy Models of Combinatorial Interpretability

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic interpretability, Combinatorial interpretability, Features, Scalability, Trustworthy machine learning
TL;DR: We introduce combinatorial interpretability, a framework for understanding neural computation via static weight-matrix analysis, and showing that neural networks compute Boolean functions through shared cross-neuron codes.
Abstract: We introduce combinatorial interpretability, a methodology for understanding neural computation by analyzing the combinatorial structures in the sign-based categorization of a network's weights and biases. We demonstrate its power through feature channel coding, a theory that explains how neural networks compute Boolean expressions and potentially underlies other categories of neural network computation. According to this theory, features are computed via feature channels: unique cross-neuron encodings shared among the inputs the feature operates on. Because different feature channels share neurons, the neurons are polysemantic and the channels interfere with one another, making the computation appear inscrutable. We show how to decipher these computations by analyzing a network's feature channel coding, offering complete mechanistic interpretations of several small neural networks that were trained with gradient descent. Crucially, this is achieved via static combinatorial analysis of the weight matrices, without examining activations or training new autoencoding networks. It also allows us for the first time to exactly quantify and explain the relationship between a network’s parameter size and its computational capacity (the set of features it can compute with low error), a relationship that is implicitly at the core of many modern scaling laws.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 15
Loading