The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Jinbo Wang; Mingze Wang; Zhanpeng Zhou; Junchi Yan; Weinan E; Lei Wu

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, Lei Wu

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We identify a clear sharpness disparity across transformer blocks and introduce a novel Blockwise Learning Rate (LR) strategy that speeds up language model pre-training by up to 2x.

Abstract: Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important. In this paper, we uncover a clear **sharpness disparity** across these blocks, which intriguingly emerges early in training and persists throughout the training process. Building on this insight, we propose a novel **Blockwise Learning Rate (LR)** strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.

Lay Summary: Modern AI such as ChatGPT are powered by transformer models, which are made up of different types of blocks working together. In this study, we found that these parts exhibit distinct optimization property called *sharpness*, roughly speaking, how quick can they learn. Based on this observation, we propose **Blockwise Learning Rate (LR)**, adjusting how quickly each part learns, rather than treating them all the same. Our method can train popular models like GPT-2 and LLaMA much faster and more efficiently, nearly halves the training time, shedding light on cheaper and more accessible AI development.

Link To Code: https://github.com/Wongboo/BlockwiseLearningRate

Primary Area: Deep Learning->Algorithms

Keywords: Sharpness, Optimization, Transformer, LLM pre-training

Submission Number: 8788

Loading