EAGLE: Efficient Analytical Gradient LinearEvaluation for Enhanced Recomputation in Large Language Models

Huang Huang; Jianqing Zhu; huang xuerong; haoran huang; Chen Zhang; Zhi Chen; Hua Wang

EAGLE: Efficient Analytical Gradient LinearEvaluation for Enhanced Recomputation in Large Language Models

Huang Huang, Jianqing Zhu, huang xuerong, haoran huang, Chen Zhang, Zhi Chen, Hua Wang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models; Gradient Checkpointing; Recomputation; Analytical Gradients

TL;DR: EAGLE is a novel recomputation strategy that analytically computes gradients for linear layers, reducing redundant computation and accelerating large-scale language model training.

Abstract: Training large language models requires substantial memory to store intermediate activations, often exceeding the capacity of modern accelerators. Gradient checkpointing addresses this challenge by trading computation for memory, but introduces significant overhead due to forward passes during recomputation. In this work, we present EAGLE (Efficient Analytical Gradient Linear Evaluation), a recomputation strategy that leverages closed-form gradients for linear transformations and integrates FlashAttention's backward algorithm for attention blocks. Unlike traditional checkpointing that uniformly replays the forward pass with autograd, EAGLE computes gradients analytically for linear layers and calls FlashAttention's backward directly, while keeping standard recomputation for nonlinear operations. On production-scale models including DeepSeek-V2, DeepSeek-V3, and LLaMA3-70B (70B--694B parameters), EAGLE improves Model FLOPs Utilization by 18--33\% over Full Recompute and achieves up to $9.75\times$ module-level recomputation speedups, and our analysis and experiments show that these gains are achieved without changing training convergence.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24660

Loading