Exploring Non-linearity in Attention

17 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attention Mechanism, Non-linearity
TL;DR: In this work, we study the position-wise non-linearity and contextual non-linearity in attention mechanism.
Abstract: The representational ability of Transformer architectures arises from two sources of non-linearity: position-wise non-linearity via feed-forward layers and contextual non-linearity through self-attention. In this work, we revisit this distinction and pose two key questions: Can self-attention itself realize position-wise non-linearity? And is contextual non-linearity truly necessary? First, we prove that by appending a fixed bias vector into the input, stacked self-attention layers can approximate deep feed-forward networks—showing that attention alone is sufficient to implement position-wise non-linearity. Second, we prove that contextual non-linearity, i.e., input-dependent attention patterns, is not indispensable: fixed or even randomly chosen patterns, when combined with a feed-forward layer, can still produce context-sensitive representations of the same token in different contexts. As an application, we prove that a two-layer attention-only Transformer can accurately predict masked tokens in masked language modeling. Both theoretical analysis and empirical studies on pre-trained models and synthetic data support our theory.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 8517
Loading