Price of Efficiency: Interpreting the Effects of Quantization on LLMs

ACL ARR 2025 February Submission1148 Authors

12 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Quantization offers a practical solution to deploy LLMs in resource-constraint environments. But it's effect on internal representation is understudied, which can question it's reliability. In this study, using various interpretation techniques, we explore the effects of quantization on model and neuron's behavior. We investigate two LLMs Phi-2 and Llama-2-7b, employing 4-bit and 8-bit quantization. Our findings reveal several important insights. First, 4-bit quantized models exhibit slightly better calibration than 8-bit and 16-bit models. Second, our analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. Regarding contribution of neurons in model prediction, we observe that full-precision models have fewer salient neurons overall. The effect of quantization on neuron redundancy varies across models. In Llama-2-7b, we observed minimal variation in neuron redundancy across quantization levels. In contrast, Phi-2 exhibited higher redundancy in its full-precision than its quantized counterparts. These findings suggest that quantization is a viable approach for the efficient and reliable deployment of LLMs in resource-constrained environments.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: calibration/uncertainty, probing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1148
Loading