Differential Gated Self-Attention

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-attention, noise-cancelling, differential gated attention, lateral inhibition
TL;DR: We propose Multihead Differential Gated Self-Attention (M-DGSA) that learns per‐head input-dependent gating to dynamically suppress attention noise.
Abstract: Transformers excel across a large variety of tasks but remain susceptible to corrupted inputs, since standard self‐attention treats all query-key interactions uniformly. Inspired by lateral inhibition in biological neural circuits and building on the recent Differential Transformer’s use of two parallel softmax subtraction for noise cancellation, we propose Multihead Differential Gated Self-Attention (M-DGSA) that learns per‐head input-dependent gating to dynamically suppress attention noise. Each head splits into excitatory and inhibitory branches whose dual softmax maps are fused by a sigmoid gate predicted from the token embedding, yielding a context-aware contrast enhancement. M-DGSA integrates seamlessly into existing Transformer stacks with minimal computational overhead. We evaluate on both vision and language benchmarks, demonstrating consistent robustness gains over vanilla Transformer, Vision Transformer, and Differential Transformer baselines. Our contributions are (i) a novel input-dependent gating mechanism for self‐attention grounded in lateral inhibition, (ii) a principled synthesis of biological contrast‐enhancement and self‐attention theory, and (iii) comprehensive experiments demonstrating noise resilience and cross-domain applicability.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 7337
Loading