Mixture of Neuron Experts

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of experts, Large language model, Pretraining
TL;DR: We achieve neuron granular expert selection on Mixture of Experts.
Abstract: In this work, We explore whether the parameters activated by the MoE layer remain highly sparse at inference. We perform a sparsification study on several representative MoE models. For each expert, we rank parameters by the magnitude of their activations from the gate projection and progressively prune the activated subset. Pruning up to $60\%$ of parameters within that subset causes only negligible task-performance degradation; substantial drops occur only after more than $90\%$ are removed. We further decompose experts into neuron granular MoE and visualize their activation values, finding that most neuron activations are near zero. This observation motivates us to select only high-activation neuron experts during pretraining. Based on this insight, we propose \emph{Mixture of Neuron Experts} (MoNE). MoNE applies a simple top-$k$ selection within each expert, incurs negligible latency, and requires no additional routing parameters or inter-expert communication. Extensive experiments demonstrate that MoNE matches standard MoE performance while activating only $50\%$ of the MoE-layer parameters, and it consistently outperforms traditional MoE when compared at equal numbers of activated parameters. These results suggest that MoNE is a practical approach to improving parameter utilization and inference efficiency in MoE-like models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1523
Loading