A Low-Latency and Scalable Vector Engine with Operation Fusion for Transformers

Mincheol Cha, Keehyuk Lee, Xuan Truong Nguyen, Hyuk-Jae Lee

Published: 01 Jan 2024, Last Modified: 13 Nov 2024AICAS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, transformer models have been widely deployed for AI services at data centers. However, one of the noticeable deployment challenges is the intensive usage of vector operations such as layer normalization (LayerNorm) and Softmax that generally show sub-optimal performance on general-purpose CPU and GPU due to their low arithmetic intensities and long data dependency. To address the problem, this study presents a low-latency and scalable FPGA-based engine for accelerating vector operations. Specifically, we built a dedicated circuit to effectively execute both element-wise operations and compound fused operations. More importantly, our engine can calculate input mean and variance in parallel, which significantly reduces the instruction count in computing LayerNorm and Softmax. Experimental results show that our design achieves a latency reduction of 50% and 40% for Softmax and LayerNorm, respectively, compared with the SOTA design, while only consuming an additional 20% DSPs, 27% BRAMs, 18% FFs, and 39% LUTs.