Gating Dropout: Communication-efficient Regularization for Sparsely Activated TransformersDownload PDFOpen Website

2022 (modified: 03 Feb 2023)ICML 2022Readers: Everyone
Abstract: Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without s...
0 Replies

Loading