MARINA-P: Superior Performance in Non-smooth Federated Optimization with Adaptive Stepsizes

Igor Sokolov, Peter Richtárik

Published: 21 Dec 2024, Last Modified: 02 Feb 2025arXivEveryoneRevisionsCC BY 4.0

Abstract: Non-smooth communication-efficient federated optimization is crucial for many practical machine learning applications, yet it remains largely unexplored theoretically. Recent advancements in communication-efficient methods have primarily focused on smooth convex and non-convex regimes, leaving a significant gap in our understanding of the more challenging non-smooth convex setting. Additionally, existing federated optimization literature often overlooks the importance of efficient server-to-worker communication (downlink), focusing primarily on worker-to-server communication (uplink). In this paper, we consider a setup where uplink communication costs are negligible and focus on optimizing downlink communication by improving the efficiency of recent state-of-the-art downlink schemes such as EF21-P [Gruntkowska et al., 2023] and MARINA-P [Gruntkowska et al., 2024] in the non-smooth convex setting. We address these gaps through several key contributions. First, we extend the non-smooth convex theory of EF21-P [Anonymous, 2024], originally developed for single-node scenarios, to the distributed setting. Second, we extend existing results for MARINA-P to the non-smooth convex setting. For both algorithms, we prove an optimal $\mathcal{O}(1/\sqrt{T})$ convergence rate under standard assumptions and establish communication complexity bounds that match those of classical subgradient methods. Furthermore, we provide theoretical guarantees for both EF21-P and MARINA-P under constant, decreasing, and adaptive (Polyak-type) stepsizes. Our experiments demonstrate that MARINA-P, when used with correlated compressors, outperforms other methods not only in smooth non-convex settings (as originally shown by Gruntkowska et al. [2024]) but also in non-smooth convex regimes. To the best of our knowledge, this work presents the first theoretical results for distributed non-smooth optimization incorporating server-to-worker compression, along with comprehensive analysis for various stepsize schemes.