Keywords: zeroth-order optimization, LLM, communication efficiency, memory efficiency, model parallelism, activation compression
Abstract: Model parallelism (MP) has emerged as a promising paradigm for distributed large language model (LLM) training across multiple computing nodes. Yet, almost all existing works about MP focus on first-order methods, which faces two persistent challenges: high communication costs from transmitting activations and gradients, and substantial memory overhead from caching them. Zeroth-order (ZO) methods, by avoiding gradient computation and storage, can naturally alleviate both memory and communication bottlenecks, but they have been largely unexplored in MP for LLM fine-tuning. In this work, we propose ***SparQ***, a ZO MP framework with **Sp**lit layer **a**llocation info**r**med by **Q**uantization-induced activation sparsity, designed to reduce memory and communication costs. *SparQ* builds on three key components: (1) leveraging the gradient-free nature of ZO optimization to eliminate gradient storage and transmission, significantly reducing memory and communication demands incurred by gradients; (2) applying quantization to induce activation sparsity that can be encoded with sparse representations; (3) strategically placing split layers at activation-sparse regions and using sparse representation to lower communication cost from activations almost without compromising model quality. Theoretically, *SparQ* achieves a sublinear convergence rate in non-convex settings, matching that of centralized ZO methods. Empirically, *SparQ* reduces GPU memory usage by over 3× and communication cost by $50\%$+ compared to state-of-the-art baselines, while maintaining comparable model performance.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 568
Loading