Hyper Experts: Language Models With Inference-Time Layer Reallocation

Published: 01 Mar 2026, Last Modified: 05 Apr 2026TTU at ICLR 2026 (Main)EveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present Hyper Experts, a new architectural class of language models capable of test-time layer reallocation. By combining all the multi-layer perceptrons in a single cross-layer shared pool of experts, our new architecture is able to construct specialized computational paths for each input token directly at inference time. To validate our approach, we train Hyper Experts models and dense Transformer baselines across different model backbone scales to study and compare the memory-computation trade-offs between old and new designs. Empirically, we show that Hyper Expert models outperform dense baselines, which we attribute to the computational flexibility afforded by our cross-layer expert-sharing principle. We present architectural and training guidelines, together with an analysis of expert similarity and routing efficiency, to identify the key properties and untapped potential of our new architecture.
Submission Number: 63
Loading