TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

ACL ARR 2025 February Submission4751 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have revolutionized natural language processing, albeit at the cost of immense memory and computation requirements. Post-training quantization (PTQ) is becoming the de facto method to reduce the memory footprint and improve the inference throughput of LLMs. In this work, we aim to push the boundary of LLM PTQ by optimizing the weight rounding parameters with the block reconstruction technique, a predominant method in previous vision models. We propose TesseraQ, an advanced PTQ technique, to quantize the weights of LLMs to ultra-low bits. To effectively optimize the rounding in LLMs and stabilize the reconstruction process, we introduce progressive adaptive rounding. This approach iteratively transits the soft rounding variables to hard variables during the reconstruction process. Additionally, we optimize the dequantization scale parameters to fully leverage the block reconstruction technique. We demonstrate that TesseraQ can be seamlessly integrated with existing transformation-based PTQ algorithms such as AWQ/OmniQuant/QuaRot, significantly enhancing their performance. For instance, when compared to AWQ, TesseraQ improves the Wikitext2 perplexity from 14.65 to 6.82 in 2-bit weight quantization.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: quantization, NLP in resource-constrained settings

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 4751

Loading