ConSmax: HW Friendly Alternative to Softmax

Gregory Kielian, Shiwei Liu, Guanchen Tao, Yifei Zou, Kauna Lei, Derek Chow, Dennis Sylvester, Zichen Fan, Bangfei Pan, Mehdi Saligane

Published: 20 Feb 2024, Last Modified: 01 Oct 2024ICCAD 2024EveryoneCC BY 4.0

Abstract: Abstract—The self-attention mechanism sets transformerbased large language model (LLM) apart from the convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon is challenging due to the extensively used Softmax in self-attention. Apart from the non-linearity, the low arithmetic intensity greatly reduces the processing parallelism, which becomes the bottleneck especially when dealing with a longer context. To address this challenge, we propose Constant Softmax (ConSmax), a softwarehardware co-design as an efficient Softmax alternative. ConSmax employs differentiable normalization parameters to remove the maximum searching and denominator summation in Softmax. It allows for massive parallelization while performing the critical tasks of Softmax. In addition, a scalable ConSmax hardware utilizing a bitwidth-split look-up table (LUT) can produce lossless non-linear operation and support mix-precision computing. It further facilitates efficient LLM inference. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2 mW and area of 0.0008 mm2 at 1250-MHz working frequency and 16-nm CMOS technology. Compared to state-ofthe-art Softmax hardware, ConSmax results in 3.35× power and 2.75× area savings with a comparable accuracy on a GPT-2 model and the WikiText103 dataset.