On IO-Efficient Attention Mechanisms: Context-Aware Bifurcated Attention and the Generalized Multi-Group Attention

Ben Athiwaratkun; Sujan Kumar Gonugondla; Sanjay Krishna Gouda; Haifeng Qian; Hantian Ding; Qing Sun; Jun Wang; Liangfu Chen; Jiacheng Guo; Parminder Bhatia; Ramesh Nallapati; Sudipta Sengupta; Bing Xiang

On IO-Efficient Attention Mechanisms: Context-Aware Bifurcated Attention and the Generalized Multi-Group Attention

Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Liangfu Chen, Jiacheng Guo, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang

Published: 20 Jun 2023, Last Modified: 16 Jul 2023ES-FoMO 2023 OralEveryoneRevisionsBibTeX

Keywords: efficient inference, multi-query, scaling laws, latency optimization, batch sampling

Abstract: Multi-query attention, a method that compresses all heads in the key and value tensors to a single head, has been known to improve inference efficiency by reducing the key and value tensor cache, allowing for incremental decoding with high batch sizes and context lengths. However, questions arise regarding how such compression affects performance compared to the traditional multi-head attention. In this paper, we investigate the scaling laws and performance of multi-query versus multi-head attention mechanisms, including a generalized multi-group attention which enables varying degrees of key-value compression. Our study reveals that each attention family demonstrates smooth and consistent performance scaling as the model size increases where the higher the compression corresponds to lower performance efficiency, leading to an upward shift in the validation loss versus size scaling curves. The finding implies that a multi-query model of comparable performance is slightly larger; hence, we present a comprehensive comparison of multi-head and multi-query models in terms of tradeoff with respect to latency, where we find that in high workload scenarios, the larger multi-query model can still be much more efficient. Additionally, we propose a novel context-aware bifurcated attention for the case of single-context batch sampling that substantially reduces memory IO, especially under high batch and context length. The bifurcated attention is an exact computation technique that divides any attention process into context and decoding components. Even though the bifurcated attention uses the same FLOPs as the original attention, it avoid the memory loading redundancy, resulting in much lower latency and makes multiple real-time recommendations available without much extra latency cost.

Submission Number: 2

Loading