Efficient SoftMax Approximation for Deep Neural Networks with Attention Mechanism

Vasyltsov Ihor; Wooseok Chang

Efficient SoftMax Approximation for Deep Neural Networks with Attention Mechanism

Vasyltsov Ihor, Wooseok Chang

21 May 2021 (modified: 16 Mar 2025)NeurIPS 2021 SubmittedReaders: Everyone

Keywords: DNN, Attention Mechanism, Softmax, Approximation, Quantization, Inference, Edge Devices

TL;DR: Methods of efficient softmax approximation for DNNs with Attention Mechanism for Inference on Edge devices without hardware divider

Abstract: There has been a rapid advance of custom hardware (HW) for accelerating the inference speed of deep neural networks (DNNs). Previously, the Softmax layer was not main concern of DNN accelerating HW, because its portion is relatively small in multi-layer perceptron or convolutional neural networks. However, as the attention mechanisms are widely used in various modern DNNs, a cost-efficient implementation of Softmax layer is becoming very important. In this paper, we propose two methods to approximate softmax computation, which are based on the usage of LookUp Tables (LUTs). The required size of LUT is quite small (about 700 Bytes) because ranges of numerators and denominators of softmax are stable if normalization is applied to the input (i.e. logits). We have validated the proposed technique over different AI tasks (object detection, machine translation, speech recognition, semantic equivalence) and DNN models (DETR, Transformer, BERT) by variety of benchmarks (COCO17, WMT14, WMT17, GLUE). We showed that 8-bit approximation allows to obtain acceptable accuracy loss ($<1.0\%$).

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 5 code implementations](https://www.catalyzex.com/paper/efficient-softmax-approximation-for-deep/code)

9 Replies

Loading