HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

Yushi Huang; Zining Wang; Ruihao Gong; Jing Liu; Xinjie Zhang; Jun Zhang

HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

Yushi Huang, Zining Wang, Ruihao Gong, Jing Liu, Xinjie Zhang, Jun Zhang

18 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion, acceleration, feature cache

Abstract: Diffusion Transformers (DiTs) have gained prominence for outstanding scalability and extraordinary performance in generative tasks. However, their considerable inference costs impede practical deployment. The feature cache mechanism, which involves storing and retrieving redundant computations across timesteps, holds promise for reducing per-step inference time in diffusion models. Most existing caching methods for DiT are manually designed. Although the learning-based approach attempts to optimize strategies adaptively, it suffers from discrepancies between training and inference, which hampers both the performance and acceleration ratio. Upon detailed analysis, we pinpoint that these discrepancies primarily stem from two aspects: (1) _Prior Timestep Disregard_, where training ignores the effect of cache usage at earlier timesteps, and (2) _Objective Mismatch_, where the training target (align predicted noise in each timestep) deviates from the goal of inference (generate the high-quality image). To alleviate these discrepancies, we propose **HarmoniCa**, a novel method that **harmoni**zes training and inference with a novel learning-based **ca**ching framework built upon _Step-Wise Denoising Training_ (SDT) and _Image Error Proxy-Guided Objective_ (IEPO). Compared to the traditional training paradigm, the newly proposed SDT maintains the continuity of the denoising process, enabling the model to leverage information from prior timesteps during training, similar to the way it operates during inference. Furthermore, we design IEPO, which integrates an efficient proxy mechanism to approximate the final image error caused by reusing the cached feature. Therefore, IEPO helps balance final image quality and cache utilization, resolving the issue of training that only considers the impact of cache usage on the predicted output at each timestep. Extensive experiments on class-conditional and text-to-image (T2I) tasks for 8 models and 4 samplers with resolutions ranging from $256\times256$ to $2048\times2048$ demonstrate the exceptional performance and speedup capabilities of our HarmoniCa. For example, HarmoniCa is the first feature cache method applied to the 20-step PixArt-$\alpha$ $1024\times1024$ that achieves over 1.5$\times$ speedup in latency with an improved FID compared to the non-accelerated model. Remarkably, HarmoniCa requires no image data during training and reduces about 25\% of training time compared to the existing learning-based approach.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1567

Loading