On Layer Normalization in the Transformer ArchitectureDownload PDFOpen Website

2020 (modified: 09 Sept 2021)ICML 2020Readers: Everyone
Abstract: The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial...
0 Replies

Loading