Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Tianlong Chen; Zhenyu Zhang; AJAY KUMAR JAISWAL; Shiwei Liu; Zhangyang Wang

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Tianlong Chen, Zhenyu Zhang, AJAY KUMAR JAISWAL, Shiwei Liu, Zhangyang Wang

Published: 01 Feb 2023, Last Modified: 22 Jun 2025ICLR 2023 notable top 25%Readers: Everyone

Keywords: Sparse Mixture-of-Experts, Random Routing, Transformer Training, Dropout

TL;DR: A new plug-and-paly strategy for training over-parameterized transformer models, leverages SMoEs with random routings to empower scaling transformers to better performance in the full capacity settings without collapse.

Abstract: Despite their remarkable achievement, gigantic transformers encounter significant drawbacks, including exorbitant computational and memory footprints during training, as well as severe collapse evidenced by a high degree of parameter redundancy. Sparsely-activated Mixture-of-Experts (SMoEs) have shown promise to mitigate the issue of training efficiency, yet they are prone to (1) $\textit{redundant experts}$ due to representational collapse; and (2) $\textit{poor expert scalability for inference and downstream fine-tuning}$, primarily due to overfitting of the learned routing policy to the number of activated experts during training. As recent research efforts are predominantly focused on improving routing policies to encourage expert specializations, this work focuses on $\textit{exploring the overlooked scalability bottleneck of SMoEs}$ and leveraging it to effectively $\textbf{scale dense transformers}$. To this end, we propose a new plug-and-play training framework, $\textbf{SMoE-Dropout}$, to enable scaling transformers to better accuracy in their full capacity without collapse. Specifically, SMoE-Dropout consists of a $\textit{randomly initialized and fixed}$ router network to activate experts and gradually increases the activated expert number as training progresses over time. Transformers trained by SMoE-Dropout naturally exhibit a $\textbf{``self-slimmable”}$ property subject to resource availability, offering smooth and consistent performance boosts with an increase in activated experts during inference or fine-tuning. Our extensive experiments across diverse transformer architectures on a variety of tasks demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts. In particular, our trained BERT outperforms its densely trained counterpart with consistent improvements of {$1.03\%$, $0.78\%$, $1.09\%$} on challenging reasoning tasks {$\texttt{ASDiv-A}$, $\texttt{MAWPS}$, $\texttt{SVAMP}$}, respectively. Codes and models are available in https://github.com/VITA-Group/Random-MoE-as-Dropout.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/sparse-moe-as-the-new-dropout-scaling-dense/code)

22 Replies

Loading