Accumulator-Aware Post-Training Quantization for Large Language Models

Ian Colbert; Giuseppe Franco; Fabian Grob; Jinjie Zhang; Rayan Saab

Accumulator-Aware Post-Training Quantization for Large Language Models

Ian Colbert, Giuseppe Franco, Fabian Grob, Jinjie Zhang, Rayan Saab

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Accumulators, Deep Learning, Inference, Quantization

TL;DR: We introduce a low-overhead framework for accumulator-aware post-training quantization that significantly improves the tradeoff between accumulator bit width and model accuracy in quantized large language models.

Abstract: Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To the best of our knowledge, ours marks the first formal study of accumulator-aware quantization in the PTQ setting. To bridge this gap, we introduce AXE—a practical, low-overhead framework of accumulator-aware extensions designed to endow overflow avoidance guarantees to existing layer-wise PTQ algorithms. We theoretically motivate AXE and demonstrate its flexibility by implementing it on top of two state-of-the-art PTQ algorithms: GPFQ and OPTQ. We further generalize AXE to support multi-stage accumulation for the first time, opening the door for full datapath optimization and scaling to large language models (LLMs). We evaluate AXE across autoregressive language generation models and observe significant improvements in the tradeoff between accumulator bit width and model accuracy over baseline methods.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4028

Loading