EAGLE: Efficient Analytical Gradient LinearEvaluation for Enhanced Recomputation in Large Language Models
Keywords: Large Language Models; Gradient Checkpointing; Recomputation; Analytical Gradients
TL;DR: EAGLE is a novel recomputation strategy that analytically computes gradients for linear layers, reducing redundant computation and accelerating large-scale language model training.
Abstract: Training large language models requires substantial memory to store intermediate activations, often exceeding the capacity of modern accelerators. Gradient checkpointing addresses this challenge by trading computation for memory, but introduces significant overhead due to forward passes during recomputation. In this work, we present EAGLE (Efficient Analytical Gradient Linear Evaluation), a recomputation strategy that leverages closed-form gradients for linear transformations and integrates FlashAttention's backward algorithm for attention blocks. Unlike traditional checkpointing that uniformly replays the forward pass with autograd, EAGLE computes gradients analytically for linear layers and calls FlashAttention's backward directly, while keeping standard recomputation for nonlinear operations. On production-scale models including DeepSeek-V2, DeepSeek-V3, and LLaMA3-70B (70B--694B parameters), EAGLE improves Model FLOPs Utilization by 18--33\% over Full Recompute and achieves up to $9.75\times$ module-level recomputation speedups, and our analysis and experiments show that these gains are achieved without changing training convergence.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24660
Loading