EL-Attention: Memory Efficient Lossless Attention for GenerationDownload PDFOpen Website

2021 (modified: 29 Oct 2021)ICML 2021Readers: Everyone
Abstract: Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging...
0 Replies

Loading