Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

TMLR Paper4520 Authors

19 Mar 2025 (modified: 26 Mar 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mixture of Experts (MoE) LLMs enhance performance by selectively activating specialized subnetworks ("experts") per input. While MoEs offer efficiency benefits through distributed inference in typical high-throughput settings, deploying them on memory-constrained devices remains challenging, particularly for sequential token generation with batch size one. In this work, we optimize MoE for such constrained environments, where only a subset of expert weights fit into DRAM. Through empirical analysis, we show MoEs can tolerate careful deviations in expert selection with minimal predictive performance loss. Inspired by this observation, we propose a novel cache-aware routing strategy that leverages expert reuse during token generation to significantly improve cache locality. Evaluating on language modeling, MMLU, and GSM8K benchmarks, our method reduces cache miss rates by over 50%, with negligible impact on perplexity (0.1%–3%) and downstream task accuracy (<0.1%). Unlike prior methods limited by the optimal oracle cache bound, our approach surpasses this theoretical limit by allowing slight flexibility in expert selection. Finally, we present on-device results demonstrating 2$\times$ speedups on mobile hardware, offering a flexible and training-free solution to extend MoE's applicability across real-world applications.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Colin_Raffel1
Submission Number: 4520
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview