Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

Andrii Skliar; Ties van Rozendaal; Romain Lepert; Todor Boinovski; Mart Van Baalen; Markus Nagel; Paul N. Whatmough; Babak Ehteshami Bejnordi

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart Van Baalen, Markus Nagel, Paul N. Whatmough, Babak Ehteshami Bejnordi

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mixture of Experts (MoE) LLMs enhance performance by selectively activating specialized subnetworks ("experts") per input. While MoEs offer efficiency benefits through distributed inference in typical high-throughput settings, deploying them on memory-constrained devices remains challenging, particularly for sequential token generation with batch size one. In this work, we optimize MoE for such constrained environments, where only a subset of expert weights fit into DRAM. Through empirical analysis, we show MoEs can tolerate careful deviations in expert selection with minimal predictive performance loss. Inspired by this observation, we propose a novel cache-aware routing strategy that leverages expert reuse during token generation to significantly improve cache locality. Evaluating on language modeling, MMLU, and GSM8K benchmarks, our method reduces cache miss rates by over 50%, with negligible impact on perplexity (0.1%–3%) and downstream task accuracy (<0.1%). Unlike prior methods limited by the optimal oracle cache bound, our approach surpasses this theoretical limit by allowing slight flexibility in expert selection. Finally, we present on-device results demonstrating 2$\times$ speedups on mobile hardware, offering a flexible and training-free solution to extend MoE's applicability across real-world applications.

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Colin_Raffel1

Submission Number: 4520

Loading