Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Mixture-of-Experts, Sparse Models, Safety
TL;DR: This work reveals the sparse safety of MoE LLMs by discovering unsafe routes.
Abstract: By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, overlooking the safety risks associated with this sparse architecture. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering $\text{\emph{unsafe routes}}$: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the $\underline{\text{Ro}}$uter $\underline{\text{Sa}}$fety $\underline{\text{i}}$mportance $\underline{\text{s}}$core ($\textbf{RoSais}$) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a $\underline{\text{F}}$ine-grained token-layer-wise $\underline{\text{S}}$tochastic $\underline{\text{O}}$ptimization framework to discover more concrete $\underline{\text{U}}$nsafe $\underline{\text{R}}$outes ($\textbf{F-SOUR}$), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of $\sim$0.90. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5279
Loading