Accelerating Block Coordinate Descent for LLM Finetuning via Landscape Expansion

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Memory efficiency; optimization
Abstract: Finetuning large language models (LLMs) is a resource-intensive task for researchers in academia, with memory constraints posing a key bottleneck. A classic optimization method, block coordinate descent (BCD), significantly reduces memory cost by segmenting the trainable parameters into multiple blocks and optimizing one active block at a time while freezing the others. However, we identify that blindly applying BCD to train LLMs can be inefficient for two reasons. First, optimizing only the active block requires backpropagating through multiple deeper yet inactive blocks, resulting in wasteful computations. Second, the frozen blocks, when they are not quite close to optimality, can narrow the optimization landscape, potentially misguiding the training of the active block. To address these issues simultaneously, we propose integrating BCD with *landscape expansion*, which unfreezes the inactive blocks and updates them in a cost-efficient manner during the same backpropagation as the update to the active block. Experiments on 8B and 70B models demonstrate that our proposed method surpasses memory-efficient baselines and matches Adam's downstream performance while requiring only 24 GB of memory for the 8B model and 300 GB for the 70B model.
Supplementary Material: zip
Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)
Submission Number: 5797
Loading