HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration
TL;DR: A learning-based feature caching method to accelerate diffusion transformer inference.
Abstract: Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives-*aligned predicted noise vs. high-quality images*-between training and inference. These two discrepancies compromise both performance and efficiency.
To this end, we *harmonize* training and inference with a novel learning-based *caching* framework dubbed **HarmoniCa**. It first incorporates *Step-Wise Denoising Training* (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an *Image Error Proxy-Guided Objective* (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256\times256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40\\%$ latency reduction (*i.e.*, $2.07\times$ theoretical speedup) and improved performance on PixArt-$\alpha$. Remarkably, our *image-free* approach reduces training time by $25\\%$ compared with the previous method. Our code is available at https://github.com/ModelTC/HarmoniCa.
Lay Summary: Generative AI models called Diffusion Transformers can create stunning images, but they are painfully slow because they repeat similar calculations tens of times during each generation. Our team asked: *Can we reuse those repeated computations instead of recalculating them, without hurting image quality?* We built **HarmoniCa**, a “feature-caching” system that learns when to store and when to recall intermediate results inside the model. To train this cache intelligently, we introduced two new techniques: *Step-Wise Denoising Training*, which lets the model practise using its cache across the full generation process, and an *Image-Error Proxy*, which teaches the model to protect final image quality while still maximising speed. In tests on eight state-of-the-art diffusion models, HarmoniCa cut inference time by up to $40\\%$ while slightly improving image quality scores. Overall, HarmoniCa paves the way for real-time, high-resolution generative media on everyday hardware.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/ModelTC/HarmoniCa
Primary Area: Applications->Computer Vision
Keywords: diffusion transformer, acceleration, feature caching
Submission Number: 8718
Loading