Improving Transformer Inference Through Optimized Nonlinear Operations With Quantization-Approximation-Based Strategy

Published: 01 Jan 2025, Last Modified: 16 May 2025IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformers have recently shown significant performance across various tasks, such as natural language processing (NLP) and computer vision (CV). However, the performance comes at the cost of large memory and computation overhead. Existing researches primarily focus on accelerating matrix multiplication (MatMul) through techniques like quantization and pruning, notably increasing the proportion of nonlinear operations in inference runtime. Meanwhile, previous approaches designed for nonlinear operations struggle with inefficient implementation as they are incapable of achieving both computation and memory efficiency. Additionally, these methods often require retraining or fine-tuning leading to substantial costs and inconveniences. To overcome these problems, we propose efficient implementation of nonlinear operations with quantization-approximation-based strategy. Through an in-depth analysis of the dataflow and data distribution of nonlinear operations, we design distinct quantization and approximation strategies tailored for different operations. Specifically, log2 quantization and power-of-two factor quantization have been employed in Softmax and LayerNorm, complemented by logarithmic function and low-precision statistic calculation as approximation strategies. Furthermore, the proposed efficient GeLU implementation integrates a nonuniform lookup procedure alongside low-bit-width quantization. Experimental results demonstrate negligible accuracy drops without the need for retraining or fine-tuning. By implementing the hardware design, it achieves $3.14\times - 6.34\times $ energy-efficiency and $3.01\times - 10.1\times $ area-efficiency improvements compared to state-of-the-art application-specific-integrated-circuit (ASIC) designs. In system-level evaluation, substantial speedup and reductions in energy consumption of 15% to 35% are achieved for end-to-end inference across both GPU and ASIC accelerator platforms.
Loading