Towards Multiplier-Free Transformers with Stochastic Attention

ICLR 2026 Conference Submission22490 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: efficient attention, Monte Carlo attention, transformer inference, energy-efficiency, sampling, multiplier-free, memory bandwidth, KV-cache, edge devices, large language models, hardware-aware ML
TL;DR: SANTA: an unbiased post-softmax estimator that swaps the value-stage matmul for sampled gather-add + bit-shift, with compute O(n_queries·S·d_k) (linear in prefill, linear in S at decode) and sparse reads—toward multiplier-free LLM inference.
Abstract: In standard attention, a substantial fraction of compute comes from multiplying softmax weights by high-precision value vectors — even in ternary models such as BitNet, which remove multipliers elsewhere. We present Stochastic Additive No-mulT Attention (SANTA), a drop-in inference-time replacement that eliminates these value-stage multiplications. For each query, SANTA samples from the post-softmax distribution, gathers and sums selected values, and applies a single bit-shift normalization, with no expensive multipliers on the value path. SANTA’s compute scales as $O(n_{queries} \cdot S \cdot d_k)$: linear in the number of queries during prefill and linear in the sample budget $S$ during decode, while exhibiting sparse, index-based memory access. SANTA is an unbiased Monte Carlo estimator of dense attention and is orthogonal to upstream efficiency techniques (ternary quantization, low-rank kernels, sparsity, pruning). Combined with existing 1-bit/ternary quantizers, SANTA moves Transformers toward fully multiplier-free, energy-efficient inference.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22490
Loading