Keywords: Post training quantization, Deployment Efficiency, Large Language Models, Graph-based Clustering
Abstract: Post-training quantization (PTQ) is essential for efficiently deploying large language
models (LLMs) in resource-constrained environments. Recent PTQ methods
achieved near-binary precision. However, existing binarization methods rely
on position-based heuristics or fixed saliency assumptions, leading to untracked
scaling overhead and limited adaptability across model architectures. We propose
SAGE-PTQ (Saliency-Aware Graph-based Efficient PTQ), a novel approach for
arbitrary-bit quantization of LLMs. Our formulation comprises five key components:
(1) Saliency-Aware Weight Filtering: identifies salient weights based on
weight distribution statistics; (2) Affinity-Based Weight Grouping: Models inlier
weights with a subsampled graph structure to capture attention patterns and
determine the optimal number of weight groups; (3) Dual-Mode Quantizer Optimization:
iteratively optimizes weight matrix quantization, minimizing scaling
overhead by assigning a single per-channel scale to multi-bit salient weights and
a scalar per-group scale to binarized inlier weights; (4) Adaptive Saliency Thresholding:
dynamically adjusts the saliency percentage to optimally minimize quantization
error; (5) Efficient Inference Runtime: implements a layer-wise lookup
to efficiently load binarized weights for accelerated inference. Our approach
achieves an average of 1.03 weight bits and 0.004 scaling bits per matrix, significantly
outperforming SoTA schemes like BiLLM and PB-LLM binarization.
Evaluations on LLaMA-2-7B yield perplexity of 5.87 (vs. 32.48 for BiLLM) on
WikiText2. Compared to BiLLM, our method uses less than 50% of device memory
and only 6.5% of the FP16 model size, enabling 1.5× faster token decoding
on LLaMA-2-70B with a single NVIDIA L40 GPU. demonstrate strong potential
for efficient inference on edge devices. SAGE-PTQ code will be released soon.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 22923
Loading