Icarus's Wings: Disabling MoE Offloading Acceleration via a Universal Hidden Prefix Attack
Abstract: Online LLM inference service demands low latency with tight memory. Mixture of Experts (MoE) meets memory constraints by activating a few Experts, computing active ones on GPUs, and offloading others to CPUs. For latency, acceleration systems utilize Expert caching and prefetching strategies, assuming Experts' temporal locality and predictable prefetch. When these assumptions fail, wrong experts loads inflate latency, enabling Denial-of-Service (DoS). Existing LLM DoS attacks target extending LLMs' generation length, with high attack costs, weak transferability, and lack MoE analysis.
In this work, we expose this vulnerability in GPU-centric MoE offloading acceleration and present Icarus, a gradient-based universal attack injecting an adversarial prefix embedding to disable such acceleration. Icarus first incorporates Cache Temporal Locality Minimization (TLM) and Prefetch Expert Prediction Misleading (EPM) to model MoE decoding behavior systematically, then adversarially breaks acceleration. Next, a scheduler balances attack targets. Experiments show that, by applying a single prefix embedding before any user input, Icarus increases Expert replacements by 85.4\%, and decreases prefetch accuracy by 12.5\%, achieving an average 0.7$\times$ decoding slowdown under SOTA GPU-centric acceleration strategies across models and devices.
Loading