Memory Efficient Fine-Tuning of LLMs via Forward-Only Hessian-Free Coordinate Descent

Zhiyuan Yu; Yifei Cheng; Liang Ding; Xinmei Tian; Li Shen; Dacheng Tao

Memory Efficient Fine-Tuning of LLMs via Forward-Only Hessian-Free Coordinate Descent

Zhiyuan Yu, Yifei Cheng, Liang Ding, Xinmei Tian, Li Shen, Dacheng Tao

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: zeroth-order optimization, memory-efficient fine-tuning

TL;DR: This paper presents FOCUS, a memory-efficient zeroth-order BCD-Newton optimizer that fine-tunes LLMs by selectively updating layers in a forward-only manner with second-order information, reducing memory while maintaining performance.

Abstract: Fine-tuning large language models (LLMs) for specific downstream tasks has traditionally relied on memory-intensive optimizers using classical backpropagation, which demands substantial memory to store model states for gradient computation, motivating the development of memory-efficient zeroth-order optimizers that operate in a forward-only manner. However, the slower convergence of the zeroth-order optimizer remains a challenge, which recent research addresses by incorporating Hessian information to accelerate training, although storing even the diagonal Hessian requires memory equivalent to that of the model weights, leading to significant memory usage. To mitigate this problem, we propose a zeroth-order block coordinate descent (BCD)-Newton optimizer with coordinate updates adaptive to second-order information, allowing us to treat model layers as separate blocks and update only a greedily selected subset per training iteration, thereby reducing memory requirements while accelerating convergence. Specifically, at each iteration, an active set of layers is selected according to the block Gauss-Southwell-Diagonal rule, and their weights are updated while the other layers remain fixed, with compressed diagonal Hessian information stored and updated exclusively for the active layers. For fine-tuning foundation models across small to large sizes (OPT-1.3B and 30B, LLaMA-2-7B), our method achieves up to 40% memory reduction compared to existing Hessian-informed zeroth-order methods, while preserving baseline accuracy and memory usage to zeroth-order methods across various tasks, offering a memory-efficient alternative method for LLMs fine-tuning, especially on memory-constrained devices.

Primary Area: optimization

Submission Number: 12383

Loading