Keywords: Mixture-of-experts, deep learning
TL;DR: Inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE.
Abstract: Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is easy to overfit, hard to deploy, and not hardware-friendly for practitioners. In this work, inspired by the human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by exploring 4 different ways to gather knowledge from MoE to initialize a dense student model, and we then refine the dense student by knowledge distillation. We evaluate our model on both vision and language tasks. Experimental results show, with $3.7 \times$ inference speedup, the dense student can still preserve $88.2\%$ benefits from MoE counterpart.
5 Replies
Loading