HyperChr: Quantization of Heterogeneously Distributed Matrices through Distribution-Aware Subspace Partitioning

Yanshu Wang; Dayu Wang; Wang Li; Zhaoqian YAO; Dan Li; Tong Yang

HyperChr: Quantization of Heterogeneously Distributed Matrices through Distribution-Aware Subspace Partitioning

Yanshu Wang, Dayu Wang, Wang Li, Zhaoqian YAO, Dan Li, Tong Yang

25 Sept 2024 (modified: 17 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: matrix quantization; LLMs; heterogeneous distribution; Product quantization

TL;DR: We introduce HyperChr, a matrix quantization algorithm that adapts to heterogeneous data distributions across columns, enhancing accuracy and minimizing errors, especially for large model weight compression in real-world scenarios.

Abstract: Matrix quantization is crucial for reducing the memory footprint of matrices across various applications, including large-scale machine learning models and data compression. We have observed that matrices in different application domains exhibit heterogeneity in the distribution across columns. Leveraging this characteristic, we introduce \textit{HyperChr}, a novel matrix quantization algorithm tailored for heterogeneous data distributions prevalent across different matrix columns. Unlike traditional quantization methods, \textit{HyperChr} capitalizes on the heterogeneous distribution characteristics of each column to optimally partition high-dimensional subspaces and perform compression within each subspace. This technique enhances the compression effectiveness by grouping vectors with similar distribution ranges, enabling more precise quantization. Moreover, \textit{HyperChr} dynamically adjusts the number of centroids in each subspace based on the specific data distribution traits, optimizing both storage efficiency and data fidelity. We evaluate \textit{HyperChr}'s performance on diverse datasets, demonstrating its superiority in reducing quantization errors compared to existing methods. Our results show that \textit{HyperChr} exhibits significant improvements at lower compression ratios ($\theta = 2-8$), reducing MAE by an average of 55.3\% and MSE by 75.3\% compared to PQ. However, at higher compression ratios ($\theta = 10-16$), the improvements are more moderate, with an average reduction of 14.9\% in MAE and 25.9\% in MSE compared to PQ. In addition, our algorithm reduces the average dequantization time by 62.9\%, which is crucial for large language model inference.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4670

Loading