DiT-Serve: An Efficient Serving Engine for Diffusion Transformers

ICLR 2026 Conference Submission17030 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion, Systems, ML, Serving
TL;DR: Efficient Inference Engine for Diffusion Models
Abstract: Diffusion Transformers (DiTs) are emerging as a powerful class of generative models for high-fidelity image and video generation, powering highly diverse applications where requests vary in image resolution, video length, and number of denoising steps. Current serving infrastructures largely optimize each request in isolation, missing key opportunities to multiplex GPU compute across requests. Our analysis uncovers two fundamental inefficiencies: spatial underutilization, where GPUs waste compute and memory by padding heterogeneous requests to a common resolution and duration; and temporal underutilization, where batching jobs with varying denoising steps forces GPU cores to idle as shorter requests wait for the longest-running request to finish. To address this, we introduce DiT-Serve, an efficient serving engine for image and video models. First, we propose step-level batching, which the scheduler preempts and swaps requests every denoising step, eliminating temporal bubbles. The second innovation is a new attention algorithm, Brick Attention, that binpacks requests of different context lengths onto a set of GPUs, significant reducing padding overhead. Our evaluation over three state-of-the-art models show that DiT-Serve achieves on average 2-3× higher throughput and 3-4× lower latency compared to prior systems.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 17030
Loading