Abstract: Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework specifically designed for LLMs, utilizing a novel W(1+1)A(1×4) configuration, where weights are quantized to 1 bit and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 × INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. This effectively reduces activation quantization errors and enables the use of INT1 multiplication to accelerate overall inference computations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods towards fully binarized models.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Quantization for LLMs; Low-Resource Method for LLMs
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 8465
Loading