Keywords: Diffusion Model, Model Acceleration, Model Compression, Feature Caching
Abstract: Diffusion transformers have become the most powerful models for visual generation, but still suffer from massive computation costs. To solve this problem, feature caching has been proposed to cache the features of diffusion models in the previous computation steps and then reuse them in the following caching steps, which brings significant acceleration but also degradation in generation quality. To address this problem, this paper proposes Z-cache as a feature caching method that can maintain high-quality generation through self-reflection. Concretely, we observe that the error from feature caching tends to be sharply reduced after each full computation. Based on this observation, Z-Cache is designed to first predict the features in the future caching steps and then perform a full computation. After that, Z-Cache returns to the caching steps and re-predicts them based on the previous and the current computation steps, which brings correction in features. Experiments demonstrate that with Z-Cache, diffusion transformers achieve comparable generation quality to the original model but with faster inference speed, for instance, 5.53$\times$ acceleration on FLUX-dev for text-to-image generation. Our codes have been released in the supplementary materials and will be released on GitHub.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4826
Loading