CoSMoEs: Compact Sparse Mixture of Experts

Patrick Huber; Akshat Shrivastava; Ernie Chang; Chinnadhurai Sankar; Ahmed A Aly; Adithya Sagar

CoSMoEs: Compact Sparse Mixture of Experts

Patrick Huber, Akshat Shrivastava, Ernie Chang, Chinnadhurai Sankar, Ahmed A Aly, Adithya Sagar

Published: 05 May 2026, Last Modified: 12 May 20264th ALVR SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: On-Device, MoE, Efficiency

TL;DR: We propose a new set of Mixture-of-Experts specifically optimized for on-device inference

Abstract: Sparse Mixture of Expert (MoE) models are widely used foundation architectures at large scale, yet remain under-explored at smaller sizes. In this work, we introduce Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference, addressing three key challenges: Quality, Memory, and Latency. On the quality front, we conduct a fair evaluation (removing confounding factors) and show that MoE architectures outperform dense models at on-device scale. We further propose weight-decomposed experts, which improve MoE performance beyond the standard formulation. On the memory and latency front, we address the prohibitively large parameter count of MoE models by improving expert offloading efficiency through a novel training-time loss, reducing inference latency for on-device deployment

Submission Number: 7

Loading