LASER: Attention with Exponential Transformation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We identify an issue with gradients backpropagating through standard attention mechanism in large language model and propose LASER to alleviate this issue.
Abstract: Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 7.7 billion parameters with an average improvement of upto 1.44% over standard attention on downstream evaluations and 1.65% finetuning improvements. Additionally, LASER demonstrates generalization performance improvement across a variety of tasks (vision, text and speech):Vision Transformer (ViT) on Imagenet, Conformer on the Librispeech speech-to-text and BERT with 2.2 billion parameters.
Lay Summary: We identified a key bottleneck in the attention mechanism used by transformers, which weakens the backpropagation signal and makes training inefficient. Our solution, LASER, applies a simple exponential transformation to the representations before the attention step, which strengthens the gradient signal. This method requires only minimal code changes and results in consistent performance improvements across text, image, and speech models.
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Large language modeling, deep learning, transformer, Conformer, ViT
Submission Number: 12201
Loading