SwiftMax: Reducing Training Time for Learnable Softmax Alternative in Customized Acceleration

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Softmax, Hardware/Software Co-design, Transformers, ACAP
Abstract: Softmax's row-wise max and sum impose an $O(n)$ normalizer substep inside self-attention, creating latency and bandwidth bottlenecks on modern accelerators. We introduce \textbf{SwiftMax}, a drop-in, learnable alternative that replaces these reductions with per-layer scalars $\beta,\gamma$, removing the length-$n$ dependency in the normalizer while leaving $QK^\top$ and value mixing unchanged. SwiftMax is enabled by a \emph{layer-wise replace-and-tune} schedule that updates only $\beta,\gamma$ on top of a frozen pretrained model; initialization is guided by the output statistics of the Softmax normalizer (distributions of $z_{\max}$ and $\sum_j e^{z_j-z_{\max}}$). On BERT-base across GLUE, SwiftMax matches the Softmax baseline within 1--3 accuracy points on SST-2/MNLI/QQP; compared with approaches that retrain all parameters to learn these scalars (e.g., ConSmax-style training), SwiftMax cuts end-to-end training time by orders of magnitude (up to $2{,}250\times$ in our setting). On AMD ACAP, eliminating the row dependency enables up to $23\times$ speedup for the self-attention normalizer and substantial module-level gains, alleviating pipeline stalls and memory traffic. Taken together, SwiftMax offers a practical path to hardware-friendly attention with minimal accuracy loss and without full retraining, bridging the gap between pretrained models and custom acceleration.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 10971
Loading