Keywords: LLM, Quantization, Pruning
Abstract: The proliferation of Large Language Models (LLMs) with extended context windows is severely hampered by the quadratic complexity of the self-attention mechanism. Existing acceleration methods, such as sparse attention and quantization, often employ uniform compression strategies that are misaligned with the non-uniform distribution of information importance within attention maps. This leads to a suboptimal trade-off between computational efficiency and model accuracy. To address this, we introduce Block-based Mixed-precision Attention (BMAttn), a novel framework that enables fine-grained, importance-aware precision while maintaining a hardware-friendly structure. BMAttn partitions each attention head into high-precision, low-precision, and sparse regions. To ensure computational regularity, these regions are block-aligned. To adapt to varying input lengths, their boundaries are dynamically adjusted using a lightweight affine windowing mechanism. We further propose a saliency-weighted calibration method and a layer-adaptive regularizer to automatically determine the optimal parameters, achieving a superior accuracy-efficiency balance. BMAttn achieves a speedup of up to 3.3× without any accuracy degradation, and a 5× speedup with only a 1\% accuracy loss.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 1941
Loading