Keywords: Methods (probing, steering, causal interventions), Applications of interpretability, Other
Other Keywords: Mixture-of-Depths, conditional computation, alternating attention, part-of-speech probing
TL;DR: MoD routers trained on Gemma 2 2B prefer full-attention to sliding-window layers and route by part-of-speech in a layer-dependent way.
Abstract: Mixture-of-Depths (MoD) routers improve efficiency of transformer models by learning which tokens to process and which to bypass at each
layer. However, their learned routing patterns have not been characterized in alternating-attention architectures, which mix local attention with sparse global attention to increase efficiency. We evaluate routing by training token-level MoD routers on Gemma 2 2B and find that the mean
routing rate is significantly higher in full-attention layers than at sliding-window layers. We find that a token’s part of speech is correlated with whether the router skips or processes it, and that the same category is often routed differently at different layers: for example, determiners are preferentially skipped at the shallowest target layer but preserved at deeper ones. Because these patterns hold within single layers, they are not subject to the confound (perfect coincidence of attention type and layer parity in Gemma 2) that limits our routing-rate result.
Submission Number: 716
Loading