Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

Natalia Frumkin; Diana Marculescu

Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

Natalia Frumkin, Diana Marculescu

Published: 02 Mar 2026, Last Modified: 01 Apr 2026ReALM-GEN 2026 - ICLR 2026 WorkshopEveryoneRevisionsCC BY 4.0

Keywords: Post-Training Quantization, Diffusion Model Drift, Diffusion Models

TL;DR: adapting the scheduler for quantized few-step diffusions

Abstract: Text-to-image diffusion models remain computationally intensive: generating a single image typically requires dozens of passes through large transformer backbones (e.g., SDXL uses ~50 evaluations of a 2.6B-parameter model). Few-step variants reduce the step count to 2–8, but still rely on large, full-precision backbones, making inference impractical on resource-constrained platforms — both on-device (latency and energy constraints) and in data centers using multi-instance GPU (MIG) partitioning with limited memory and throughput per slice. Existing post-training quantization (PTQ) methods are further limited by their dependence on full-precision calibration. We introduce Q-Sched, a scheduler-level PTQ approach that adapts the diffusion sampler while keeping quantized weights fixed. By adjusting the few-step sampling trajectory with quantization-aware preconditioning coefficients, Q-Sched matches or surpasses full-precision quality while delivering a 4× reduction in model size and preserving a reusable checkpoint across bit-widths. To learn these coefficients, we propose a reference-free Joint Alignment–Quality (JAQ) loss that combines text–image compatibility with an image-quality objective for fine-grained control. JAQ requires only a small set of calibration prompts and avoids full-precision inference during calibration. Empirically, Q-Sched achieves substantial gains: a 15.5% FID improvement over the FP16 4-step Latent Consistency Model and a 16.6% improvement over the FP16 8-step Phased Consistency Model, demonstrating that quantization and few-step distillation are complementary for high-fidelity generation. A large-scale user study with 80,000+ annotations further validates these results on both FLUX.1[schnell] and SDXL-Turbo. Code will be released.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 88

Loading