Achieving binary weight and activation for LLMs using Post-Training Quantization
Abstract: Quantizing large language models (LLMs) to
1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance
degradation when using weight and activation
precisions below 4 bits (W4A4). In this paper,
we propose a post-training quantization framework with W(1+1)A(1×4) configuration, where
weights are quantized to 1 bit with an additional
1 bit for fine-grain grouping and activations
are quantized to 1 bit with a 4-fold increase
in the number of channels. For weight quantization, we propose utilizing Hessian-aware
fine-grained grouping along with an EM-based
quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 × INT1 format equivalently
and simultaneously smooth the scaling factors
based on quantization errors, which further
reduces the quantization errors in activations.
Our method surpasses state-of-the-art (SOTA)
LLM quantization baselines on W2A4 across
multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully
binarized models.
Loading