Abstract: RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices.
Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models.
However, it suffers significant degradation of performance when applied to RWKV.
This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy.
To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy capable of adaptively selecting different quantization approaches by assessing the uniformity and identifying outliers in the weights, and (2) a codebook optimization algorithm that enhances the performance of cluster-based quantization methods for element-wise multiplication in RWKV.
Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1\% accuracy loss and 2.14$\times$ speed up.
Lay Summary: RWKV, a modern RNN-Transformer hybrid architecture with performance comparable to Transformers, faces challenges in edge deployment due to its large size and high computational cost. Existing post-training quantization (PTQ) methods, effective for Transformers, underperform on RWKV for two key reasons: its non-linear operations (such as Token Shift and Sigmoid) disrupt parameter fusion, while its uniformly distributed weights undermine clustering-based quantization. To address this, we introduce RWKVQuant, the first PTQ framework for RWKV, featuring a coarse-to-fine proxy mechanism that uses entropy and higher-order moments to adaptively select vector or scalar quantization by identifying weight outliers, and a codebook optimization for element-wise multiplications via activation-weighted KMeans. Experiments show RWKVQuant quantizes RWKV-6-14B to ~3-bit with <1% accuracy loss, 2.14$\times$ faster inference, 1/3 memory usage, and outperforms GPTQ/AWQ in language/vision tasks, enabling practical edge deployment for RWKV and new lightweight solutions for large models.
Primary Area: Deep Learning->Large Language Models
Keywords: RWKV, Quantization
Submission Number: 5419
Loading