On the Spatial Structure of Mixture-of-Experts in Transformers

Daniel Bershatsky; Ivan Oseledets

On the Spatial Structure of Mixture-of-Experts in Transformers

Daniel Bershatsky, Ivan Oseledets

Published: 05 Mar 2025, Last Modified: 06 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: MoE, Mixture-of-Experts, Load Balancing, Transformers, LLM, RoPE, PE, Positional Encodings

TL;DR: MoE-based transformers leverage both semantic and positional token information for routing, challenging conventional assumptions and offering insights for more efficient model design.

Abstract: A common assumption is that MoE routers primarily leverage semantic features for expert selection. However, our study challenges this notion by demonstrating that positional token information also plays a crucial role in routing decisions. Through extensive empirical analysis, we provide evidence supporting this hypothesis, develop a phenomenological explanation of the observed behavior, and discuss practical implications for MoE-based architectures.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 66

Loading