On the Spatial Structure of Mixture-of-Experts in Transformers

Published: 05 Mar 2025, Last Modified: 06 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 4 pages)
Keywords: MoE, Mixture-of-Experts, Load Balancing, Transformers, LLM, RoPE, PE, Positional Encodings
TL;DR: MoE-based transformers leverage both semantic and positional token information for routing, challenging conventional assumptions and offering insights for more efficient model design.
Abstract: A common assumption is that MoE routers primarily leverage semantic features for expert selection. However, our study challenges this notion by demonstrating that positional token information also plays a crucial role in routing decisions. Through extensive empirical analysis, we provide evidence supporting this hypothesis, develop a phenomenological explanation of the observed behavior, and discuss practical implications for MoE-based architectures.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 66
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview