CMQuant: A Quantization-Aware Parameter-Efficient Fine-Tuning Framework for 4-Bit Consistency Models

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Quantization, Consistency Models
Abstract: Consistency Models (CMs), built on diffusion models, use model state trajectory fitting to reduce iterations required for sample generation. However, they still maintain high per-iteration computational costs and large model parameter sizes, which hinder deployment on resource-constrained devices. Quantization, an effective model compression technique with notable success in Large Language Models, remains largely unexplored for CMs. We observe that the unique characteristics of CMs pose significant obstacles to effective quantization. First, the trajectory fitting errors inherent to CMs accumulate across iterations and are further amplified when quantization errors are involved. Second, CMs training often relies on Low-Rank Adaptation (LoRA), which injects low-rank matrices into specific layers without altering pretrained weights. However, quantization errors not only disrupt this initialization consistency, but also hinder training optimization and impair convergence. In this paper, we propose CMQuant, a novel quantization-aware parameter-efficient fine-tuning framework tailored for CMs. CMQuant introduces three innovations: (1) Trajectory Distillation with Phased Targets (TDPT), which assigns distinct optimization objectives to different stages of trajectory, enables accurate starting points for each stage and thereby minimizes accumulated quantization errors in iterations. (2) Hessian-Guided SVD-Initialized LoRA (HGS-LoRA), which leverages hessian-guided matrix decomposition to initialize LoRA, directing weight updates along quantization-friendly paths and thereby reducing quantization errors. (3) Quantization-Aware Rank Adaptation (QRA), which assigns ranks adaptively based on the degree of variation in activations and weights across different CMs layers. This minimizes the impact of quantization without increasing the total number of LoRA parameters. By integrating quantization into the CMs training, CMQuant achieves the first 4-bit quantization of CMs for both weights and activations. Experiments show that CMQuant outperforms SOTA at least FID↓/PickScore↑/IC↑ of 4.74/11.61/3.01 on FLUX. Furthermore, it improves throughput by 1.71×/3.43× on SDXL/FLUX, with only 27\%/25\% memory footprint.
Primary Area: generative models
Submission Number: 6967
Loading