BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture-of-Experts LLMs, backdoor attack, Routing Optimization
Abstract: Mixture-of-Experts (MoE) architectures are rapidly becoming the standard for building scalable, efficient large language models (LLMs). Their open availability, however, exposes them to supply-chain backdoor attacks, where an adversary can modify a checkpoint and redistribute a poisoned version. MoE’s intrinsic sparsity further amplifies this risk, as small changes in activated experts may disproportionately influence the model’s output. In this work, we propose BadMoE, a novel backdoor attack that exploits the overlooked structural vulnerabilities introduced by expert sparsity and routing. We first provide theoretical intuition that the MoE output can be governed by dominating experts. Guided by this insight, BadMoE poisons underutilized (``dormant'') experts and utilizes routing-aware triggers to activate them, enabling stealthy and effective manipulation. Specifically, BadMoE involves three steps: 1) identifying dormant experts unrelated to the target task, 2) optimizing a routing-aware trigger toward these experts, and 3) promoting them to dominating roles through training data. Extensive experiments on three MoE LLMs across multiple backdoor tasks show that BadMoE, using only two injected experts, can reliably control outputs, outperform existing attacks, and evade current defenses. By leveraging architectural sparsity and dynamic usage profiling, our approach uncovers backdoor vulnerabilities in MoE LLMs that are overlooked by traditional attacks, highlighting potential security risks in emerging sparse architectures.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10698
Loading