2021 (modified: 29 Oct 2021)ICML 2021Readers: Everyone
Abstract:Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging...