SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
TL;DR: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
Abstract: This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.
Lay Summary: Creating high-quality images from text descriptions typically requires massive computing power, putting advanced AI image generation out of reach for many researchers and developers.
We developed SANA-1.5, a smarter AI system that achieves top-tier image generation while being dramatically more efficient. Our key breakthroughs include:
1. A "growing" training method that builds larger models using 60% less computing power
2. A compression technique that shrinks models without losing quality
3. A clever sampling trick that lets smaller models temporarily boost their capabilities
These innovations allow SANA-1.5 to match or exceed the performance of systems like Stable Diffusion XL while being more accessible. On standard tests, it achieves record-breaking accuracy in matching images to text descriptions (80% alignment score when using our sampling boost).
By making advanced image generation more efficient, SANA-1.5 helps democratize AI creativity - enabling more researchers to experiment with the technology and developers to integrate it into applications without needing expensive hardware.
Link To Code: https://github.com/NVlabs/Sana
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Diffusion Model, Inference Scaling, Linear Attention
Submission Number: 1929
Loading