Keywords: diffusion models, generative models, image generation, video generation
Abstract: Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, hindering principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, resulting in wasteful resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. The latent interface works through lightweight Read and Write cross-attention layers that move information between spatial tokens and latents to prioritize the most important input regions. Additionally, by training with random dropping of tail latents, this module learns to produce importance-ordered representations with earlier latents capturing global structure while later latents contain information to refine details. At inference, the number of latents can be dynamically adjusted to match time or compute constraints by focusing its capacity on ``hard'' regions. Our proposed approach is deliberately minimal, leaving the rectified flow objective and the DiT stack unchanged, and adding only two cross-attention layers.
Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average of 35.3% and 39.6% improvement in FID and FDD scores over baselines.
Primary Area: generative models
Submission Number: 2651
Loading