Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

ICLR 2026 Conference Submission14057 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Quantization, Diffusion Models, Diffusions, Text-to-Image, Compression, Model Compression, Image Quality Assessment
TL;DR: Scheduler Adaptation for Quantized Few-Step Diffusions
Abstract: Text-to-image diffusion models remain computationally intensive: generating a single image typically requires dozens of passes through large transformer backbones (for example, SDXL uses about 50 evaluations of a 2.6B-parameter model). Few-step variants reduce the step count to 2–8, but still rely on large, full-precision U-Net/DiT backbones, making inference impractical on resource-constrained platforms, both on-device (latency/energy) and in data centers with multi-instance GPU (MIG) style partitioning (limited memory/throughput per slice). Existing post-training quantization (PTQ) methods are further hampered by dependence on full-precision calibration. We introduce Q-Sched, a scheduler-level PTQ approach that adapts the diffusion sampler rather than the model weights. By adjusting the few-step sampling trajectory with quantization-aware preconditioning coefficients, Q-Sched matches or surpasses full-precision quality while delivering a 4x reduction in model size and preserving a single reusable checkpoint across bit-widths. To learn these coefficients, we propose a reference-free Joint Alignment–Quality (JAQ) loss, which combines text–image compatibility with an image-quality objective for fine-grained control; JAQ requires only a handful of calibration prompts and avoids any full-precision inference during calibration. Empirically, Q-Sched yields substantial gains: a 15.5% FID improvement over the FP16 4-step Latent Consistency Model and a 16.6% improvement over the FP16 8-step Phased Consistency Model, demonstrating that quantization and few-step distillation are complementary for high-fidelity generation. A large-scale user study with more than 80,000 annotations further validates these results on both FLUX.1[schnell] and SDXL-Turbo. Code will be released.
Primary Area: generative models
Submission Number: 14057
Loading