ActiveEvict: Budget-Aware Pre-Eviction for Efficient MoE Offloading

ActiveEvict: Budget-Aware Pre-Eviction for Efficient MoE Offloading

ACL ARR 2026 January Submission3695 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts, Expert Offloading, Pre-Eviction, Budget-Aware Routing, I/O-Bound Inference

Abstract: Mixture-of-Experts (MoE) architectures enable scaling Large Language Models (LLMs) by decoupling model capacity from computation. However, their large parameter footprint makes expert offloading to host memory necessary, creating I/O-bound inference bottlenecks. Existing methods rely on prefetching to hide latency but remain limited by short computation windows and passive one-for-one eviction under static budgets. We observe that coupling eviction and loading on the critical path causes frequent pipeline stalls and poor cache utilization. To address this, we propose ActiveEvict, a framework that proactively evicts experts and performs budget-aware routing, transforming static memory budgets into dynamic effective budgets. This decoupling reduces I/O stalls and enables better GPU memory utilization. Experiments show that ActiveEvict reduces blocking I/O time by up to 46\% compared to state-of-the-art prefetch methods, while incurring less than 1\% accuracy loss, demonstrating significant throughput improvement under constrained memory.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 3695

Loading