FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs

Jun Liu, Shulin Zeng, Li Ding, Widyadewi Soedarmadji, Hao Zhou, Zehao Wang, Jinhao Li, Jintao Li, Yadong Dai, Kairui Wen, Shan He, Yaqi Sun, Yu Wang, Guohao Dai

Published: 27 Feb 2025, Last Modified: 06 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Video Generation Model (VGM), as a representative of multi-modal large models, has revolutionized the productivity of video content creation. VGMs are compute-bound due to adopting the Diffusion Transformer (i.e., DiT) structure. Sparsification is a common method for accelerating compute-intensive models. Still, sparse VGMs cannot fully exploit the effective throughput (i.e., TOPS) of GPUs. FPGAs are good candidates for accelerating sparse deep learning models. However, existing FPGA accelerators still face low throughput ( < 2TOPS) on VGMs due to the significant gap in peak computing performance (PCP) with GPUs ( > 21× ). To achieve a higher throughput than GPUs, FPGA-based acceleration of sparse VGMs still faces the following challenges: large redundancy in activations, low performance of DSPs under hybrid precision, and under-utilization using static compilation for online compression.To tackle these challenges, we propose FlightVGM, the first FPGA accelerator for efficient VGM inference with activation sparsification and hybrid precision. In FlightVGM, our motivation stems from VGMs exhibiting different compression preferences in various dimensions and layers. To exploit the video frames' similarity in the temporal and spatial dimensions, we propose a spatial-temporal online activation sparsification architecture, reducing the computational cost by 3.17×. To provide a good trade-off between the accuracy and efficiency of VGMs, we employ fixed-point precision for linear layers and retain floating-point precision for attention layers. Then, we propose a floating-fixed hybrid precision DSP58 expansion architecture on the AMD V80 FPGA, boosting the PCP by 3.26×. Finally, to make FlightVGM available to various workloads, we propose a dynamic-static combined adaptive scheduling method for low-overhead online sparsification, improving the computation utilization by 2.75×. Implemented on the AMD V80 FPGA, FlightVGM surpasses NVIDIA 3090 GPU by 1.30× in performance and 4.49× in energy efficiency on various sparse VGM workloads.

External IDs:doi:10.1145/3706628.3708864