More Expressive Attention with Negative Weights

Ang Lv; Ruobing Xie; Shuaipeng Li; Jiayi Liao; Xingwu Sun; Zhanhui Kang; Di Wang; Rui Yan

More Expressive Attention with Negative Weights

Ang Lv, Ruobing Xie, Shuaipeng Li, Jiayi Liao, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Attention with negative weights, Language models

Abstract: We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness. This stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads, which use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention’s OV matrix can focus more on refinement. (2) Cog Attention improves the model’s robustness against representational collapse by preventing earlier tokens from "over-squashing" into later positions. We develop Transformer-like models that use Cog Attention as attention modules, including decoder-only models with up to 3 billion parameters for language modeling. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11897

Loading