CCE: A 28nm Content Creation Engine with Asymmetric Computing, Semantic-Driven Instruction Generation and Collision-Free Outlier Mapper for Video Generation

Published: 2025, Last Modified: 29 Jan 2026CICC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Content creation applications have become a cornerstone of nextgeneration personal devices. A prime example is video generation, which involves generation, language encoding, editing, and enhancement. These tasks rely heavily on diverse models like denoised diffusion, transformers, and super-resolution (SR) (Fig. 1). However, generating a 15-minute, 60fps 720p video using a GPU (Nvidia A100) currently takes approximately 52 hours, which is impractical for lengthy videos or iterative refinement. Accelerating this process presents three key challenges: 1) Asymmetric computing is needed. Integer format is not enough for diffusion and Transformer models. However, the significant overhead in FP MACs hinders the processing efficiency. While previous work often targets symmetric computing format, INT*FP computing is required here as it strikes a better balance between model performance and MAC complexity. Additionally, leveraging similarities across adjacent frames and denoising steps requires variable precision support for INT operations. 2) Efficient utilization of redundancy in multi-modal input is required. Substantial storage and power could be saved by filtering out low information input part. 3) Accelerate multiple tasks with diverse operators is required. Diffusion, Transformer and SR models consist of varying proportions of different layers, making heterogeneous optimization inefficient considering the restrained chip area. A unified architecture could be designed by distilling common features from these tasks. By subtracting the P-frame/key denoised step, all three tasks can be reformulated to dense low bit-width data plus high sparse outlier computations.
Loading