What Do Mixture-of-Depth Routers Learn? Routing Patterns in Gemma 2

Stein Pleiter; Ege Erdogan; Ana Lucic

What Do Mixture-of-Depth Routers Learn? Routing Patterns in Gemma 2

Stein Pleiter, Ege Erdogan, Ana Lucic

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Applications of interpretability, Other

Other Keywords: Mixture-of-Depths, conditional computation, alternating attention, part-of-speech probing

TL;DR: MoD routers trained on Gemma 2 2B prefer full-attention to sliding-window layers and route by part-of-speech in a layer-dependent way.

Abstract: Mixture-of-Depths (MoD) routers improve efficiency of transformer models by learning which tokens to process and which to bypass at each layer. However, their learned routing patterns have not been characterized in alternating-attention architectures, which mix local attention with sparse global attention to increase efficiency. We evaluate routing by training token-level MoD routers on Gemma 2 2B and find that the mean routing rate is significantly higher in full-attention layers than at sliding-window layers. We find that a token’s part of speech is correlated with whether the router skips or processes it, and that the same category is often routed differently at different layers: for example, determiners are preferentially skipped at the shallowest target layer but preserved at deeper ones. Because these patterns hold within single layers, they are not subject to the confound (perfect coincidence of attention type and layer parity in Gemma 2) that limits our routing-rate result.

Submission Number: 716

Loading