Keywords: Quantization, Mixed Precision, Interpretability
TL;DR: Using interpretability informed saliency scores based on task-specific information to localize important weights to preserve during model compression, yielding SOTA method for both general and task specific quantization
Abstract: Post-training quantization reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. Existing methods mitigate these drops by keeping some important weights in higher precision; we develop a new mixed-precision approach, Task-Circuit Quantization (TCQ), that directly conditions the quantization process on specific circuits -- which we define as sets of weights associated with downstream task performance. TCQ draws parallels to automated circuit discovery, introducing a novel method to identify a small number of key weights that are particularly important to task performance; these weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TCQ-based quantization to existing mixed-precision quantization methods and GPTQ when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, text-to-SQL tasks and for both Llama-3 and Qwen2.5 models, we find that TCQ outperforms baselines like SPQR and Slim-LLM using the same calibration data and a lower weight budget, achieving major improvements in the 2- and 3-bit regime. With only 3.1 bits we are able to recover 97% of the model's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. Furthermore, we observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, Slim-LLM. Code: [https://github.com/The-Inscrutable-X/TACQ](https://github.com/The-Inscrutable-X/TACQ)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1441
Loading