Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

ACL ARR 2026 January Submission4313 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: extreme quantization, model smoothness, input gradient
Abstract: Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that models also suffer from systematic smoothness degradation under extreme quantization, which cannot be explained by numerical accuracy alone. We confirm that input gradients serve as an effective proxy for smoothness in transformer-based LLMs and use this metric to reveal limitations in both post-training quantization and quantization-aware training. Based on this analysis, we propose a smoothness-preserving principle to maintain gradient propagation during quantization. Experiments across multiple models and tasks demonstrate that preserving smoothness yields additional benefits beyond numerical accuracy. Our study suggests smoothness as an important design consideration for future extreme quantization methods.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM efficiency, model quantization, weight quantization
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 4313
Loading