k-Odd One Clear (k-OOC), a novel GPU kernel that improves quantization accuracy and speed of GPTQ algorithm

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: quantization, LLM, GPTQ, BitNet
TL;DR: k-OOC is a new GPU kernel that can improve quantization accuracy and speed
Abstract: Large Language Model (LLM) demonstrated tremendously useful applications in nowadays fast-evolving AI driven technology. As the model sizes grow bigger, the demand for bigger and faster GPU is required. Another way to alleviate this issue is by improving the compression of the trained model through quantization so that lower VRAM devices can run. Quantization paradigms like GPTQ, PB-LLM, BiLLM (Hessian based with structural searching) are successful quantize mechanisms. In this paper, we propose **OOC**, a technique to pick an "odd" group to improve the quantization clarity so that the model can have better reasoning capability overall. In addition, we define **Bit Family** ($A^{lim},A^{max}$) to classify compression rate of current and past quantizing techniques, thus providing a more objective way to rank different methodologies in literature. Thirdly, to avoid compromising the quantization speed due to the **scanning** process overhead, we developed a specialized fused GPU kernel (k-OOC) where it can be $9\times$ faster than the original GPTQ implementation (single-flow mode) and $22\times$ faster than the naive OOC implementation (double-flow mode) due to the incorporation of techniques called **Row-Flow-Selection Parallel** and **Input Batching**. We measured perplexity of k-OOC (2 bits) with 14 major models like OPT, LLAMA, and Bloom (125M to 70B parameters) and popular datasets ( Wikitext2, C4, and PTB). We managed to improved the perplexity of small model by 8.9\% and of big model by 4.1\% compared to the baseline of GPTQ (2 bits).
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8918
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview