Boltzmann Routing for Energy-Compatible Mixture of Experts

Published: 03 Mar 2026, Last Modified: 06 Mar 2026NFAM 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Energy Based Models, MoE, Boltzmann Routing
TL;DR: We introduce Boltzmann Routing, a free-energy formulation of Mixture-of-Experts that restores exact energy compatibility in Energy Transformers but reveals a trade-off between variational purity and expert specialization at scale.
Abstract: The Energy Transformer (ET) recasts the forward pass as gradient descent on a scalar energy, connecting attention to Modern Hopfield Networks and associative memory. Scaling ETs via Mixture-of-Experts (MoE) breaks this variational structure: standard router weights depend on the token state, producing a router gradient residual that prevents the MoE output from being any energy's gradient. We propose \textbf{Boltzmann Routing}, which eliminates the external router and derives expert selection from a free-energy functional $\mathcal{F} = -\beta_r^{-1}\log\!\sum_e \exp(-\beta_r E_e)$. We prove that the negative gradient of~$\mathcal{F}$ exactly recovers the weighted expert output with zero residual, that the combined system admits a Lyapunov function, and that attention and routing are \emph{dual instances of the same associative retrieval mechanism}. Experiments across three scales (8 to 32 experts) show that Boltzmann routing achieves accuracy comparable to standard MoE (0.440 avg at 8 experts) \emph{without any auxiliary balancing loss}. A cross-scale analysis reveals a fundamental tension: exact energy compatibility comes at the cost of expert collapse at scale, and collapse count alone does not determine performance, with implications for energy-based routing more broadly.
Submission Number: 34
Loading