One Student Knows All Experts Know: From Sparse to Dense

Fuzhao Xue; Xiaoxin He; Xiaozhe Ren; Yuxuan Lou; Yang You

One Student Knows All Experts Know: From Sparse to Dense

Fuzhao Xue, Xiaoxin He, Xiaozhe Ren, Yuxuan Lou, Yang You

01 Mar 2023 (modified: 11 Apr 2023)Submitted to Tiny Papers @ ICLR 2023Readers: Everyone

Keywords: Mixture-of-experts, deep learning

TL;DR: Inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE.

Abstract: Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is easy to overfit, hard to deploy, and not hardware-friendly for practitioners. In this work, inspired by the human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by exploring 4 different ways to gather knowledge from MoE to initialize a dense student model, and we then refine the dense student by knowledge distillation. We evaluate our model on both vision and language tasks. Experimental results show, with $3.7 \times$ inference speedup, the dense student can still preserve $88.2\%$ benefits from MoE counterpart.

5 Replies

Loading