Abstract: Attention has been a crucial component in the success of image diffusion models, however, their
quadratic computational complexity limits the sizes
of images we can process within reasonable time
and memory constraints. This paper investigates
the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention
mechanisms. We propose a novel training-free
method ToDo that relies on token downsampling of
key and value tokens to accelerate Stable Diffusion
inference by up to 2x for common sizes and up to
4.5x or more for high resolutions like 2048×2048.
We demonstrate that our approach outperforms previous methods in balancing efficient throughput
and fidelity.
Loading