Keywords: spectral conditioning, attention
TL;DR: We propose spectral conditioning of attention layers to improve Jacobian conditioning, leading to more stable and efficient optimization with negligible computational overhead and consistent gains across diverse transformer architectures.
Abstract: We present a theoretical analysis of the Jacobian of a attention block within a transformer, showing that it is governed by the query, key, and value projections that define the attention mechanism. Leveraging this insight, we introduce a method that systematically alters the spectral properties of each attention layer to reduce the Jacobian’s condition number, thereby improving the overall conditioning of the attention layers within a transformer network. We empirically show that this improved Jacobian conditioning translates to enhanced performance in practice. Our approach is simple, broadly applicable, and can be easily integrated as a drop-in replacement for a wide range of existing attention mechanisms. We validate its effectiveness across diverse transformer architectures and tasks, demonstrating consistent improvements in performance.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 21071
Loading