ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers

Feice Huang; Zuliang Han; Xing Zhou; Yihuang Chen; Lifei Zhu; Haoqian Wang

ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers

Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, Haoqian Wang

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Transformer, Quantization

TL;DR: We propose ConvRot, a convolution-like group-wise rotation method with a plug-and-play module that enables efficient W4A4 quantization for diffusion models, achieving up to 4× DiT memory savings and 2× speedup while preserving image quality.

Abstract: Diffusion models have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we develop a theoretical framework: we define column discrepancy to quantify imbalance in Hadamard matrices, prove that regular Hadamard matrices attain minimal discrepancy, and provide a Kronecker-based construction for powers-of-four orders, effectively controlling row- and column-wise outliers. Based on this, we propose ConvRot, a group-wise rotation-based quantization that reduces computation from quadratic to linear complexity while smoothing outliers, and ConvLinear4bit, a plug-and-play module fusing rotation, quantization, GEMM, and dequantization for W4A4 inference without retraining. Experiments on FLUX.1-dev achieve a 2.26$\times$ speedup and 4.05$\times$ memory reduction, while preserving image quality.

Primary Area: generative models

Submission Number: 9209

Loading