Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating strong scalability and superior performance compared to U-Net architectures. However, their deployment remains hindered by substantial computational and memory costs. While quantization-aware training (QAT) has shown promise for U-Net architectures, its application to DiTs introduces unique challenges, primarily due to activation sensitivity and distributional complexity. In this work, we identify activation quantization as the principal bottleneck in pushing DiTs to extreme low-bit settings, and present a systematic QAT study of quantization for DiTs, namely RobuQ. We first establish a strong ternary-weight (W1.58A4) DiT baseline. To achieve robust activation quantization, we then propose RobustQuantizer, supported by theoretical analysis showing the Hadamard transform converts unknown per-token distributions into known normal distributions. In addition, we introduce AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs, which distributes mixed-precision activation to eliminate information bottlenecks, achieves state-of-the-art performance at W1.58A3, and stably supports average precision as low as W1.58A2 without collapse. Extensive experiments on unconditional and conditional image generation show that our framework consistently outperforms prior state-of-the-art quantization methods, achieving highly efficient DiT quantization while preserving generative fidelity. Together, these contributions substantially advance the practical deployment of DiTs in challenging ultra-low bit quantization scenarios.
Loading