Ultra-Low Accumulation Precision Inference with Block Floating Point Arithmetic

Jun He; Xin Ju; Mei Wen; Yasong Cao; Zhongdi Luo; Jianchao Yang; Jingkui Yang; Gang Li; Jian Cheng

Ultra-Low Accumulation Precision Inference with Block Floating Point Arithmetic

Jun He, Xin Ju, Mei Wen, Yasong Cao, Zhongdi Luo, Jianchao Yang, Jingkui Yang, Gang Li, Jian Cheng

13 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: accumulation precision; block floating-point quantization; MAC; deep learning

TL;DR: We predict the boundaries of accumulation precision in deep learning inference GEMM within the framework of Block Floating-Point (BFP) arithmetic, and complete the hardware design in accordance with the prediction accumulation precision.

Abstract: Block Floating Point (BFP) quantization offers a hardware-efficient numerical range trade-off. Previous studies have quantized weights and activations to an extremely low precision using the BFP arithmetic. However, as the precision of weights and activations is reduced, we have identified that accumulation becomes a hardware bottleneck in the BFP MAC. Nevertheless, existing attempts to decrease the precision of accumulation in matrix multiplication have generally preserved model performance through training with a pre-selected, fixed accumulation precision. Nonetheless, selecting an unduly low precision leads to notable performance degradation, and these studies lack an effective approach to establish the lower precision limit, potentially incurring considerable training costs. Hence, we propose a statistical method to analyze the impact of reduced accumulation precision on the inference of deep learning applications. Due to the presence of fixed-point accumulation and floating-point accumulation in BFP matrix multiplication, we have formulated a set of equations to relate the data range of fixed-point multiply-accumulate operations and the effects of floating-point swamping to the parameters of BFP quantization, the length of accumulation, model weights, and the minimum number of bits required for accumulation, thereby determining the appropriate accumulation precision. Applied to MMLU Llama2-7B, SQuAD-v1.1 BERT-Large and BERT-Base and CIFAR-10 ResNet-50, our precision settings yield performance close to the FP32 baseline. Meanwhile, further precision reduction degrades performance, indicating our approach’s proximity to precision limits. Guided by our equations, the hardware exhibited a 13.7\%-28.7\% enhancement in area and power efficiency over high-precision accumulation under identical quantization configuration, and it demonstrated a $10.3\times$ area reduction and an $11.0\times$ power reduction compared to traditional BFP16 implementations.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 189

Loading