BMAttn: Block-Aligned Mixed-Precision  Attention Quantization for LLM Inference

Zining Wang; Haojie Duanmu; Fanliu Kong; Zhihang Yuan; Jinyang Guo; Size Zheng; Yang Zhang; Shouda Liu; Xianglong Liu

BMAttn: Block-Aligned Mixed-Precision Attention Quantization for LLM Inference

Zining Wang, Haojie Duanmu, Fanliu Kong, Zhihang Yuan, Jinyang Guo, Size Zheng, Yang Zhang, Shouda Liu, Xianglong Liu

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Quantization, Pruning

Abstract: The proliferation of Large Language Models (LLMs) with extended context windows is severely hampered by the quadratic complexity of the self-attention mechanism. Existing acceleration methods, such as sparse attention and quantization, often employ uniform compression strategies that are misaligned with the non-uniform distribution of information importance within attention maps. This leads to a suboptimal trade-off between computational efficiency and model accuracy. To address this, we introduce Block-based Mixed-precision Attention (BMAttn), a novel framework that enables fine-grained, importance-aware precision while maintaining a hardware-friendly structure. BMAttn partitions each attention head into high-precision, low-precision, and sparse regions. To ensure computational regularity, these regions are block-aligned. To adapt to varying input lengths, their boundaries are dynamically adjusted using a lightweight affine windowing mechanism. We further propose a saliency-weighted calibration method and a layer-adaptive regularizer to automatically determine the optimal parameters, achieving a superior accuracy-efficiency balance. BMAttn achieves a speedup of up to 3.3× without any accuracy degradation, and a 5× speedup with only a 1\% accuracy loss.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 1941

Loading