Zero-Order One-Point Gradient Estimate in Consensus-Based Distributed Stochastic Optimization

Published: 19 Nov 2024, Last Modified: 19 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this work, we consider a distributed multi-agent stochastic optimization problem, where each agent holds a local objective function that is smooth and strongly convex and that is subject to a stochastic process. The goal is for all agents to collaborate to find a common solution that optimizes the sum of these local functions. With the practical assumption that agents can only obtain noisy numerical function queries at precisely one point at a time, we consider an extention of a standard consensus-based distributed stochastic gradient (DSG) method to the bandit setting where we do not have access to the gradient, and we introduce a zero-order (ZO) one-point estimate (1P-DSG). We analyze the convergence of this techniques using stochastic approximation tools, and we prove that it \textit{converges almost surely to the optimum} despite the biasedness of our gradient estimate. We then study the convergence rate of our method. With constant step sizes, our method competes with its first-order (FO) counterparts by achieving a linear rate $O(\varrho^k)$ as a function of number of iterations $k$. To the best of our knowledge, this is the first work that proves this rate in the noisy estimation setting or with one-point estimators. With vanishing step sizes, we establish a rate of $O(\frac{1}{\sqrt{k}})$ after a sufficient number of iterations $k > K_0$. This rate matches the lower bound of centralized techniques utilizing one-point estimators. We then provide a regret bound of $O(\sqrt{k})$ with vanishing step sizes. We further illustrate the usefulness of the proposed technique using numerical experiments.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=yAoavrPtBq
Changes Since Last Submission: In this revised version, where the title was changed from “Zero-Order One-Point Estimate with Distributed Stochastic Gradient Techniques” to “Zero-Order One-Point Gradient Estimate in Consensus-Based Distributed Stochastic Optimization,” we addressed the decision letter raised by the editor and reviewers. In particular, we removed the gradient tracking from the algorithm, we removed the restrictive assumption $\nabla \mathcal{F}(x^*)=0$ where $x^*$ is the optimal solution, and we updated the presentation and the proofs accordingly. We state the details in what follows. We consider a general consensus-based distributed optimization framework instead of the algorithm “gradient tracking”. We thus change the presentation and the proofs accordingly of Lemma 3.3 and Theorems 3.2 and 3.4-3.7, wherein the proof of Lemmas 3.3 and D.1 (Appendices D.2-D.3), the consensus error is no longer affected by the gradient error of agents through the auxiliary variable of the gradient tracking algorithm, but rather through direct manipulation of the inequality. This effect is carried subsequently to the proofs of Theorem 3.2 and 3.4-3.7, where the upper bound of the (expected) divergence is written as a function of the consensus error and, thus, of the gradient error. The other parts of the divergence are written in terms of the average of the gradient and, thus, previously, in terms of the auxiliary variable, where these two terms were equal. We further remove the restrictive assumption $\nabla \mathcal{F}(x^*)=0$ where $x^*$ is the optimal solution, i.e., $\mathcal{F}(x^*) =\min_{x\in\mathcal{K}} \mathcal{F}(x)$. The updated assumption is written in Assumption 1.2. The proofs of convergence and convergence rate in Theorems 3.2, 3.4, 3.5, and 3.7 are then updated, where the term $z_K$ (in (33)) is dealt with using only the strong convexity property in inequality (34). The same property is then used to analyze the rate of inequalities (53) and (67). The major necessary change is applied to the proof of Theorem 3.6 (Appendix F), where the regret bound is now analyzed in terms of a modified definition of the divergence we denoted by $D_k’$, that measures the average error between the agents’ variables and the optimum, instead of the error between the average variable and the optimum. We then analyze this new entity and use tricks related to the properties of the matrix W and the convexity of the norm square function and the objective function to derive a relation with the regret. We then derive the convergence rate of $D_k’$ and make use of the properties of the step sizes to finally find a bound on the regret. We hope this revised version addresses all concerns raised by the editors and reviewers, and we look forward to their feedback. We are really grateful for all the comments made by the editor and the reviewers that have substantially improved the quality of this work. We thank you again for all the efforts invested in the previous and current submissions of our manuscript. Kind regards, The authors.
Assigned Action Editor: ~Yunwen_Lei1
Submission Number: 3183
Loading