Keywords: visual generation,flow matching,pyramidal patchification
TL;DR: A new method to accelerate diffusion model with pyramidal patchification.
Abstract: Diffusion Transformers (DiTs) typically use the same patch size for
$\operatorname{Patchify}$
across timesteps,
enforcing a constant token budget across timesteps.
In this paper, we introduce Pyramidal Patchification Flow (PPFlow),
which reduces the number of tokens for high-noise timesteps
to improve the sampling efficiency.
The idea is simple:
use larger patches at higher-noise timesteps and smaller patches at lower-noise timesteps.
The implementation is easy:
share the DiT's transformer blocks across timesteps,
and learn separate linear projections
for different patch sizes in
$\operatorname{Patchify}$
and
$\operatorname{Unpatchify}$.
Unlike Pyramidal Flow
that operates on pyramid representations,,
our approach operates over
full latent representations,
eliminating trajectory ``jump points'',
and thus avoiding re-noising tricks for sampling.
Training from pretrained SiT-XL/2 requires only $+8.9\%$ additional training FLOPs and delivers
$2.02\times$
denoising speedups with image generation quality kept;
training from scratch achieves comparable
sampling speedup,
e.g.,
$2.04\times$ speedup in SiT-B.
Training from text-to-image model FLUX.1, PPFlow can achieve $1.61 - 1.86 \times$ speedup from 512 to 2048 resolution with comparable quality.
Primary Area: generative models
Submission Number: 16502
Loading