Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

Published: 11 Jun 2026, Last Modified: 20 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Feature Geometry, Methods (probing, steering, causal interventions), Circuit Analysis, Attribution Graphs
Other Keywords: mixture of experts, gated-linear-units, routing, algorithmic tasks
TL;DR: MoE FFNs make small Transformers shift computation into attention, primarily due to architectural sparsity rather than learned routing.
Abstract: Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates task-relevant Fourier structure out of the per-neuron basis and into distributed subspaces, making neuron-level interpretability less informative while preserving structured computation. We validate these conclusions with random-routing, narrow-FFN, and top-2 MoE controls, plus parameter-matching, activation-function, and width-scaling analyses. Together, these results show that local FFN design choices can have nonlocal consequences for Transformer computation.
Submission Number: 556
Loading