Keywords: Grouped-Query Attention, Transformer, LoRA
TL;DR: We propose an activation-informed method for transforming pretrained multi-head attention to grouped-query attention
Abstract: Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA).
To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers.
In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance.
Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5\% on MMLU compared to neighbour grouping. Our approach addresses the GQA’s trade-off problem between model performance and hardware efficiency.
Submission Number: 49
Loading