Faster Diffusion Through Temporal Attention Decomposition

TMLR Paper3503 Authors

16 Oct 2024 (modified: 27 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We explore the role of the attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. Self-attention, however, initially plays a minor role but becomes increasingly important in the second phase. These findings yield a simple and training-free method called TGATE which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experiments show TGATE’s broad applicability to various existing text-conditional diffusion models which it speeds up by 10-50%. The code of TGATE is available at [Placeholder].
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Hanwang_Zhang3
Submission Number: 3503
Loading