Keywords: post training quantization, transformer, activation function
TL;DR: Quantize computation when the pre-activation input is negative
Abstract: Large batch sizes in transformer-based language and vision AI applications mean that performance is increasingly bottlenecked by linear layer computation, and weight-only quantization only exacerbates this computational bottleneck.
While full 4-bit weight and activation post-training quantization with no model quality loss remains an open challenge, we offer a novel approach of selective channel-wise quantized W4A4 and W8A8 computation.
We observe that the gradients of transformer activation functions (ReLU, GELU, SiLU) are small when inputs are negative, which means that quantization error in pre-activation function inputs result in small output error.
Exploiting this insight, we propose Activation Function Informed Quantization (AFIQ), which samples dot-product partial products on a single calibration example to determine which channels to quantize for all future model inference.
We implement a mixed-precision linear layer kernel in CUDA to evaluate latency and we find that AFIQ linear layers are 17\% faster than baseline with negligible loss in model quality.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 4315
Loading