Mixed INT4-INT8 LLM Quantization via Progressive Layerwise Assignment with Dynamic Sensitivity Estimation

Published: 2025, Last Modified: 07 Nov 2025ISCAS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Quantization is essential for optimizing large language models (LLMs) by reducing memory usage and computational load. However, traditional low-bit quantization methods applied uniformly across all layers can significantly degrade accuracy, especially on modern hardware architectures. We introduce a novel, adaptive quantization strategy leveraging Layer Sensitivity, which assigns bit-width to each layer based on its sensitivity to quantization. This method includes both Static Sensitivity Estimation—a single-pass sensitivity measurement to rank layers by quantization tolerance—and Dynamic Sensitivity Estimation, which iteratively re-evaluates layer sensitivity after each quantization step, ensuring optimal bit-width allocation. Tested on models such as GPT2, OPT, and LLaMA-3.2, our approach minimizes accuracy loss and outperforms uniform quantization methods. This scalable solution effectively balances computational efficiency and accuracy, offering a robust path for deploying low-precision LLMs.
Loading