Efficient Fine-Tuning of Large Language Models with Zeroth-Order Model Parallelism

Efficient Fine-Tuning of Large Language Models with Zeroth-Order Model Parallelism

04 May 2026 (modified: 06 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Model parallelism (MP) is a widely adopted paradigm for scaling large language model (LLM) training across multiple nodes. Yet, existing methods mainly rely on first-order optimization, which suffer from two key bottlenecks: high communication overhead due to frequent transmission of activations and gradients, and substantial memory consumption caused by caching these intermediate states. Zeroth-order (ZO) optimization offers a compelling alternative by eliminating explicit gradient computation and storage, naturally reducing communication and memory costs. Despite these advantages, ZO methods remain largely unexplored in the context of MP for LLM fine-tuning. In this work, we first investigate activation sparsity patterns induced by common activation functions (e.g., ReLU, GELU, SwiGLU) during LLM fine-tuning. Motivated by these key observations, we propose SparQ, a ZO-based MP framework that exploits quantization-induced activation sparsity to reduce memory footprint and communication overhead. SparQ consists of three key components: (1) using the gradient-free nature of ZO optimization to eliminate gradients; (2) applying activation quantization to induce sparsity that enables efficient sparse encoding; and (3) strategically placing split layers at sparsity-rich regions and transmitting activations in sparse form, significantly reducing communication cost with minimal impact on model performance. We theoretically establish that SparQ achieves a sublinear convergence rate for non-convex objectives. Extensive experiments show that SparQ reduces GPU memory usage by over 3× and communication cost by 50%+ compared to state-of-the-art MP baselines, while maintaining comparable LLM fine-tuning performance across multiple tasks.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Aaron_Klein1

Submission Number: 8751

Loading