Scaling Overhead Matters: Saliency-Aware Graph-Based Efficient Post-Training Quantization for LLMs

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Post training quantization, Deployment Efficiency, Large Language Models, Graph-based Clustering
Abstract: Post-training quantization (PTQ) is essential for efficiently deploying large language models (LLMs) in resource-constrained environments. Recent PTQ methods achieved near-binary precision. However, existing binarization methods rely on position-based heuristics or fixed saliency assumptions, leading to untracked scaling overhead and limited adaptability across model architectures. We propose SAGE-PTQ (Saliency-Aware Graph-based Efficient PTQ), a novel approach for arbitrary-bit quantization of LLMs. Our formulation comprises five key components: (1) Saliency-Aware Weight Filtering: identifies salient weights based on weight distribution statistics; (2) Affinity-Based Weight Grouping: Models inlier weights with a subsampled graph structure to capture attention patterns and determine the optimal number of weight groups; (3) Dual-Mode Quantizer Optimization: iteratively optimizes weight matrix quantization, minimizing scaling overhead by assigning a single per-channel scale to multi-bit salient weights and a scalar per-group scale to binarized inlier weights; (4) Adaptive Saliency Thresholding: dynamically adjusts the saliency percentage to optimally minimize quantization error; (5) Efficient Inference Runtime: implements a layer-wise lookup to efficiently load binarized weights for accelerated inference. Our approach achieves an average of 1.03 weight bits and 0.004 scaling bits per matrix, significantly outperforming SoTA schemes like BiLLM and PB-LLM binarization. Evaluations on LLaMA-2-7B yield perplexity of 5.87 (vs. 32.48 for BiLLM) on WikiText2. Compared to BiLLM, our method uses less than 50% of device memory and only 6.5% of the FP16 model size, enabling 1.5× faster token decoding on LLaMA-2-70B with a single NVIDIA L40 GPU. demonstrate strong potential for efficient inference on edge devices. SAGE-PTQ code will be released soon.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 22923
Loading