Quantum Entanglement Trees: Optimizing Quantized Matrix Quantization via Element Replacement and Residual Clustering
Keywords: Matrix quantization, LLM Weight Quantization, KV Cache Quantization, Residual Quantization
TL;DR: We introduce Quantum Entanglement Trees, an algorithm that optimizes matrix quantization by reordering elements to exploit local orderliness, significantly enhancing quantization in the weights of LLMs and KV caches.
Abstract: The matrix quantization entails representing matrix elements in a more space-efficient form to reduce storage usage, with dequantization restoring the original matrix for use. We formulate the Quantization Error Minimization (QEM) problem as minimizing the distance between a matrix before and after quantization, under the condition that the quantized matrix occupies the same memory space. Matrix quantization is crucial in various applications, including Large Language Models (LLMs) weight quantization, vector databases, KV cache quantization, graph compression, and image compression. Recent advancements in LLMs, such as GPT-4 and BERT, have highlighted the importance of matrix compression due to the large size of parameters and KV cache, which are stored as matrices.
We propose Quantum Entanglement Trees (QET) to address the QEM problem by leveraging the local orderliness of matrix elements, involving iterative element swapping to form a locally ordered matrix. This matrix is then grouped and quantized by columns. To enhance QET, we introduce two optimizations: Residual Quantization Optimization (RQO), which reduces MSE by quantizing the residuals between the original and dequantized matrices, and Codebook Quantization Optimization (CQO), which reduces storage requirements by compressing the codebook itself.
Experimental results demonstrate that QET can effectively reduce MSE to 5.05\%, 13.33\%, and 11.89\% of the current best method on the LLM dataset, K cache, and V cache, respectively.
Our contributions include the abstraction of the QEM problem, the design of the QET algorithm, and the proposal of two optimizations to improve accuracy and speed.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1565
Loading