Published: 2021, Last Modified: 12 May 2023ICML 2021Readers: Everyone
Abstract:Transformer architectures are widely used, but training them is non-trivial, requiring custom learning rate schedules, scaling terms, residual connections, careful placement of submodules such as n...