Sparse MoE Students for Efficient Knowledge Distillation

CVPR 2025 Workshop Robo-3Dvlm Submission11 Authors

15 May 2025 (modified: 09 Jun 2025)Submitted to Robo-3DvlmEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Distillation, Mixture of Experts, Sparse Routing, Modular Neural Networks, Model Compression, Attention-based Gating, Efficient Deep Learning
TL;DR: We propose a sparse, modular MoE student for knowledge distillation that outperforms dense baselines and even the teacher under lower compute budgets, using class-agnostic experts and adaptive routing strategies.
Abstract: We propose a compact and modular student architecture for knowledge distillation (KD) based on a sparse Mixture-of-Experts (MoE) framework. Unlike conventional dense student models, our design uses a set of lightweight, class-agnostic experts whose outputs are dynamically routed via input-conditioned gating. We systematically compare multiple routing strategies—soft, top-$k$, and attention-enhanced variants—and evaluate their impact across accuracy, computational cost, and expert utilization. Experiments on CIFAR-10 and CIFAR-100 show that sparse MoE students not only outperform dense baselines under similar or lower resource budgets, but also achieve superior parameter-efficiency and more consistent expert usage. Notably, attention-based routing consistently yields the best trade-off between accuracy and cost. Our findings highlight the structural benefits of modular sparse students in KD, offering improved generalization, interpretability, and efficiency without requiring class supervision.
Submission Number: 11
Loading