Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Rui Liu, Young Jin Kim, Alexandre Muzio, Hany Hassan

2022 (modified: 03 Feb 2023)ICML 2022Readers: Everyone

Abstract: Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without s...

0 Replies