Keywords: Generative models, diffusion, inference acceleration
TL;DR: Diffusion on Demand
Abstract: Diffusion transformers demonstrate significant potential for various generation tasks but are challenged by high computational cost. Recently, feature caching methods have been introduced to improve inference efficiency by storing features at certain timesteps and reusing them at subsequent timesteps. However, their effectiveness is limited as they rely only on choosing between cached features and performing model inference. Motivated by high cosine similarity between features across consecutive timesteps, we propose a cache-based framework that reuses features and selectively adapts them through linear modulation. In our framework, the selection is performed via a modulation gate, and both the gate and modulation parameters are learned. Extensive experiments show that our method achieves similar generation performance to the original sampler while requiring significantly less computation. For example, FLOPs and inference latency are reduced by $2.93\times$ and $2.15\times$ for DiT-XL/2 and by $2.83\times$ and $1.50\times$ for PixArt-$\alpha$, respectively. We find that modulation is effective when applied to as little as 2\% of layers, resulting in negligible computation overhead.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 19855
Loading