Revisiting Differential Attention: A Fine-Tuning Perspective on Practical Noise Mitigation

Revisiting Differential Attention: A Fine-Tuning Perspective on Practical Noise Mitigation

ICLR 2026 Conference Submission16857 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Differential Attention, Model Fine-Tuning, Architecture Migration

Abstract: The self-attention mechanism in Transformer models is widely adopted but remains vulnerable to attention noise. Differential Transformer and its variant DEX attempt to address this issue; however, the former requires training from scratch, while the latter cannot directly mitigate noise during the attention computation process. In this paper, we propose DAA, a novel method that can both reduce attention noise and be flexibly inserted during the fine-tuning stage. Specifically, DAA introduces learnable modules to the self-attention mechanism in the the process of calculating attention scores, to achieve a differential mechanism. We find that DAA can offset attention noise while introducing few parameters (less than 1\% of the total model parameters), and directly act on the updates of the K and Q matrices, achieving effects similar to those of a Differential Transformer model trained from scratch. We further compare our approach with two methods that explore different positions of differentiation: one modifies the input sequence to separately compute K, Q, or V, while the other regulates the output matrix (DEX). Experimental results show that DAA can better effectively improve model performance with a small amount of fine-tuning data.

Primary Area: generative models

Submission Number: 16857

Loading