Abstract: This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationallyexpensive UNet-based denoising operations in ev-ery generation step, we identify that not all operations are equally relevant for the final output quality. In par-ticular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small pertur-bations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be per-turbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple base-lines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to compa-rable or improved perceptual scores with drastically re-duced computational complexity. As an example, for Sta-ble Diffusion vI.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change. We re-lease code at https://github.com/Qualcomm-AI-research/clockwork-diffusion
Loading