ActiveEvict: Budget-Aware Pre-Eviction for Efficient MoE Offloading

ACL ARR 2026 January Submission3695 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture-of-Experts, Expert Offloading, Pre-Eviction, Budget-Aware Routing, I/O-Bound Inference
Abstract: Mixture-of-Experts (MoE) architectures enable scaling Large Language Models (LLMs) by decoupling model capacity from computation. However, their large parameter footprint makes expert offloading to host memory necessary, creating I/O-bound inference bottlenecks. Existing methods rely on prefetching to hide latency but remain limited by short computation windows and passive one-for-one eviction under static budgets. We observe that coupling eviction and loading on the critical path causes frequent pipeline stalls and poor cache utilization. To address this, we propose ActiveEvict, a framework that proactively evicts experts and performs budget-aware routing, transforming static memory budgets into dynamic effective budgets. This decoupling reduces I/O stalls and enables better GPU memory utilization. Experiments show that ActiveEvict reduces blocking I/O time by up to 46\% compared to state-of-the-art prefetch methods, while incurring less than 1\% accuracy loss, demonstrating significant throughput improvement under constrained memory.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3695
Loading