Abstract: Abstract—The self-attention mechanism sets transformerbased large language model (LLM) apart from the convolutional and recurrent neural networks. Despite the performance
improvement, achieving real-time LLM inference on silicon is
challenging due to the extensively used Softmax in self-attention.
Apart from the non-linearity, the low arithmetic intensity greatly
reduces the processing parallelism, which becomes the bottleneck
especially when dealing with a longer context. To address this
challenge, we propose Constant Softmax (ConSmax), a softwarehardware co-design as an efficient Softmax alternative. ConSmax
employs differentiable normalization parameters to remove the
maximum searching and denominator summation in Softmax. It
allows for massive parallelization while performing the critical
tasks of Softmax. In addition, a scalable ConSmax hardware
utilizing a bitwidth-split look-up table (LUT) can produce lossless
non-linear operation and support mix-precision computing. It
further facilitates efficient LLM inference. Experimental results
show that ConSmax achieves a minuscule power consumption
of 0.2 mW and area of 0.0008 mm2
at 1250-MHz working
frequency and 16-nm CMOS technology. Compared to state-ofthe-art Softmax hardware, ConSmax results in 3.35× power and
2.75× area savings with a comparable accuracy on a GPT-2
model and the WikiText103 dataset.
Loading