SDA: Low-Bit Stable Diffusion Acceleration on Edge FPGAs

Geng Yang; Yanyue Xie; Zhong Jia Xue; Sung-En Chang; Yanyu Li; Peiyan Dong; Jie Lei; Weiying Xie; Yanzhi Wang; Xue Lin; Zhenman Fang

SDA: Low-Bit Stable Diffusion Acceleration on Edge FPGAs

Geng Yang, Yanyue Xie, Zhong Jia Xue, Sung-En Chang, Yanyu Li, Peiyan Dong, Jie Lei, Weiying Xie, Yanzhi Wang, Xue Lin, Zhenman Fang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024FPL 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper introduces SDA, the first effort to adapt the expensive stable diffusion (SD) model for edge FPGA deployment. First, we apply quantization-aware training to quantize its weights to 4 -bit and activations to 8 -bit ($W 4 A 8$) with a negligible accuracy loss. Based on that, we propose a high-performance hybrid systolic array (hybridSA) architecture that natively executes convolution and attention operators across varying quantization bit-widths (e.g., $W 4 A 8$ and all 8 -bit $Q K^{T} V$ in attention). To improve computational efficiency, hybridSA integrates diverse DSP packing techniques into hybrid weightstationary and output-stationary dataflows that are optimized for convolution and attention. It also supports flexible dataflow transitions to address the distinct demands of its output sequence by subsequent nonlinear operators. Moreover, we observe that nonlinear operators become the new performance bottleneck after the acceleration of convolution and attention, and offload them onto the FPGA as well. To reduce the latency of each nonlinear operator, we pipeline its own execution at a fine granularity. To minimize the resource utilization of nonlinear operators, we carefully balance their execution with hybridSA in a coarse-grained pipeline. Experimental results demonstrate that our low-bit ($W 4 A 8$) SDA accelerator on the embedded AMDXilinx ZCU102 FPGA achieves a speedup of $97.3 \times$ (which takes about $\mathrm{2 . 1}$ minutes for one SD inference), compared to the original SD-v1.5 model on the ARM Cortex-A53 CPU (which takes about 3.5 hours for one SD inference). Our SDA project is open sourced here: https://github.com/Michaela1224/SDA_code.

Loading