Keywords: LLMs, Differential Attention, Model Fine-Tuning, Architecture Migration
Abstract: The self-attention mechanism in Transformer models is widely adopted but remains vulnerable to attention noise. Differential Transformer and its variant DEX attempt to address this issue; however, the former requires training from scratch, while the latter cannot directly mitigate noise during the attention computation process. In this paper, we propose DAA, a novel method that can both reduce attention noise and be flexibly inserted during the fine-tuning stage. Specifically, DAA introduces learnable modules to the self-attention mechanism in the the process of calculating attention scores, to achieve a differential mechanism. We find that DAA can offset attention noise while introducing few parameters (less than 1\% of the total model parameters), and directly act on the updates of the K and Q matrices, achieving effects similar to those of a Differential Transformer model trained from scratch. We further compare our approach with two methods that explore different positions of differentiation: one modifies the input sequence to separately compute K, Q, or V, while the other regulates the output matrix (DEX). Experimental results show that DAA can better effectively improve model performance with a small amount of fine-tuning data.
Primary Area: generative models
Submission Number: 16857
Loading