ToMA: Token Merge with Attention for Diffusion Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0
TL;DR: We propose an improved token merging algorithm to speed up diffusion for image generation, and is the only working method on diT.
Abstract: Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers’ quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose **To**ken **M**erge with **A**ttention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merging as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23% (DINO $\Delta <$ 0.07), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion.
Lay Summary: AI image generators such as Stable Diffusion paint stunning pictures, but they must process thousands of tiny image pieces—called tokens—at every step. Shuffling so many tokens makes creation slow and energy-hungry. Earlier shortcuts tried to merge similar tokens, yet the extra bookkeeping erased most of the speed gains. Our work presents ToMA (Token Merge with Attention), a plug-in that lets the model spot and temporarily group tokens that carry nearly the same information. We choose these groups with a fast, easy-to-compute rule that picks a small, diverse set of “representative” tokens, then use the same GPU-friendly math the model already employs for its internal reasoning. After the heavy thinking is done, ToMA cleanly spreads the results back to every original token, so image quality stays intact. In practice, ToMA cuts the time to create a high-resolution image by roughly one-quarter on today’s hardware while keeping visual scores nearly unchanged . Faster generation means lower energy use, smoother creative workflows, and wider public access to top-tier generative art tools.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Diffusion, Token Merge, Attention, Submodular Optimization
Submission Number: 7047
Loading