Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

ACL ARR 2026 January Submission5070 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model architectures, sparse models, quantization

Abstract: We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: model architectures,sparse models,quantization

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5070

Loading