TAMMP: An Effective Mixed-Precision Quantization Method for Enhancing Model Generalization in Low-Bit Scenarios

TAMMP: An Effective Mixed-Precision Quantization Method for Enhancing Model Generalization in Low-Bit Scenarios

ACL ARR 2026 January Submission7124 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Quantization, Mix-precision, Low-bit LLM, Bit-width Allocation

Abstract: Large Language Models (LLMs) have achieved remarkable progress across all domains. However, the substantial demands for memory and computational resources during deployment and inference hinder their adoption especially on edge devices. While quantization is a widely acknowledged compression technique successfully applied in various scenarios, models quantized to low bit-widths suffer from significant performance degradation compared to the original one. Furthermore, the magnitude of this degradation varies considerably across different tasks. To investigate the underlying causes, this paper formally introduces the critical parameter heterogeneity and hypothesize that this heterogeneity is a primary factor driving the non-uniform performance degradation observed across diverse downstream tasks. Addressing this hypothesis, a Task-Adaptive Multi-Granularity Mixed Precision Training (TAMMP) method is proposed. This approach incorporates critical parameter probing based on multi-source knowledge, multi-granular bit-width allocation, and a mixed-precision training framework. Finally, through comprehensive evaluations, we demonstrate that TAMMP achieves superior generalization performance compared to existing low-bit quantization approaches.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 7124

Loading