\documentclass{article}

% Language setting
% Replace `english' with e.g. `spanish' to change the document language
\usepackage[english]{babel}
\usepackage{amsmath,amssymb}

% Set page size and margins
% Replace `letterpaper' with `a4paper' for UK/EU standard size
\usepackage[letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}

% Useful packages
\usepackage{amsmath}
\usepackage{bm}
\usepackage{graphicx}
\usepackage[colorlinks=true, allcolors=blue]{hyperref}



\begin{document}


On the Convergence of FedAvg on Non-IID Data - Used this notation: 

\begin{equation*}
\begin{split}
    \bar{\bm{g}_t}&=\sum_{k=1}^N \nabla F_k(\bm{w}_t^k)\\
    {\bm{g}_t}&=\sum_{k=1}^N \nabla F_k(\bm{w}_t^k,\xi_t^k)\\
\end{split}
\end{equation*}


Derive everything clearly 


User-friendly's paper:
\begin{align*}
    \bar\theta_t&=\bar\theta_0+\int_0^t \nabla f(\bar\theta_s)d s + \sqrt{2} W_t\\
    \bm{\nu}_{k+1}&=\bm{\nu}_k-h \bm{Y}_k +\sqrt{2 h}\bm{\xi}_{k+1},
\end{align*}
where $\bm{Y}_k=\nabla f(\bm{\nu}_k)+\Gamma_k$.



We are subtracting the continuous exact LD from the federated sgld.

appreciate the suggestions, but there are many proof frameworks. We believe both leads to the same conclusion.

Our continuous LD can be viewed with FL alg with syn freq $\delta t\rightarrow 0$.


\section{Reply to reviewer 2}

We sincerely appreciate the reviewer for raising valid concerns regarding the continuous diffusion and providing valuable suggestions to address them.

\paragraph{Why $\bar\theta_t$ could converge to $\pi$ and the notation ${\nabla} f({\theta}_t)=\sum p_c \nabla f^c(\theta_t^c)$ is misleading.}

It is worth noting that $\bar\theta_t$ follows from a standard Langevin diffusion, which naturally converges to $\pi$. In contrast to a federated averaging (FA) Langevin diffusion which synchronizes parameters every $\nabla t=K\eta$, we discussed in Appendix A.1.1 such that the standard Langevin diffusion can be viewed as a FA Langevin diffusion with synchronization frequency $\nabla t\rightarrow 0$, implying that $\theta=\theta^c$ when $\nabla t\rightarrow 0$. %Hence ${\nabla} f({\theta})=\sum p_c \nabla f^c(\theta)=\sum p_c \nabla f^c(\theta^c)$ is a valid notation.

However, we do acknowledge that our notation of $\bm{\nabla} f(\bm{\theta}_k)=\sum p_c \nabla f^c(\theta_k^c)$ in Eq.(5) may be misleading and we have adopted a new notation $\bm{Z}_k:=\sum p_c \nabla f^c(\theta_k^c)$ following Eq.(13) of [1] and Appendix A.1 of [2] in the newly revised version to tackle that issue.


[1] User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv:1710.00095v3.

[2] On the Convergence of FedAvg on Non-IID Data. arXiv:1907.02189v4

\paragraph{The result of Theorem~5.7 should hold, however, the provided proof seems invalid.}

We appreciate your suggestions on how to fix the issue. We believe there are multiple proof techniques and our proof is also correct. Our Lemma B.5 proposes to subtract Eq.19 (federated Langevin dynamics that synchronizes every $\nabla t=K\eta$) from the standard Langevin diffusion Eq.22 (can be interpreted as a federated algorithm that synchronizes every $\nabla t\rightarrow 0$). Since $\theta_k$ is not accessible when $k \text{ mod } K \neq 0$, we next figured out a crucial contraction property in Lemma B.1 to tackle the divergence issue when $\theta_k\neq \theta_k^c$.

We are very grateful for your concerns about the correctness of our work, but we are quite confident about our proof. We would kindly like to invite you to review the three-page proof in Lemma B.5 and B.1 again to see if there are any major mistakes.



\section{Reply to reviewer 3}

\paragraph{It is also not clear what is the main challenge the author encountered when studying this new problem.}

In recent years, there has been extensive research into the convergence analysis of Langevin dynamics using both exact and stochastic gradients [1,2,3,4]. However, unlike classical centralized learning methods that are not encumbered by communication concerns, federated learning integrates training data from diverse users while minimizing interactions with the central server to address privacy and communication issues. This presents unique challenges in terms of implementing distributed numerical schemes that rely on infrequent communications. Although our Eq.5 closely resembles the standard SGLD algorithm, the parameter $\theta_k$ becomes inaccessible and $\theta_k^c\neq \theta_k$ when $k \text{ mod } K\neq 0$. Consequently, it becomes crucial to identify the divergence between $\theta_k$ and $\theta^c_k$, which motivates our exploration of the trade-off between communication, accuracy, and privacy. Similar non-trivial distributed extensions have been investigated in [5].


[1] User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv:1710.00095v3. Stoch. Proc. Appl. 2019.

[2] Analysis of Langevin Monte Carlo via Convex Optimization. JMLR. 2019.


[3] Theoretical guarantees for approximate sampling from a smooth and log-concave density. J. R. Stat. Soc. B, 2017

[4] Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. COLT'17.

[5] Decentralized Stochastic Gradient Langevin Dynamics and Hamiltonian Monte Carlo. JMLR'22.






\paragraph{Whether the results are tight or not and it will be nice to have SOTA SGLD convergence rate results (even not in FL setting)}




If we consider the standard centralized learning with $K=1$ and set $\tau=1, \gamma=0$, our Theorem 5.7 follows that
\begin{align*}
    W_2(\mu_{k}, \pi) &\lesssim  \left(1-\frac{\eta m}{4}\right)^k \cdot W_2(\mu_{0}, \pi)+\kappa^{1.5}\sqrt{{\eta} d},
\end{align*}
which achieves the $\epsilon$-W2-error within $\Omega(\frac{d}{\epsilon^2})$ iterations (ignoring logarithmic factors) and is consistent with the best (SOTA) results provided in Table 1 of [2]; Comparable outstanding outcomes have also been demonstrated in [1, 2, 3, 5, 6, 7, 8]. Like [1], our reliance on the condition number is not optimized [2], and further enhancement remains a potential avenue for future extensions.

% Although it is appealing to recover the optimal/ SOTA results of SGLD when we fix $K=1$, it is also worth mentioning that maintaining the optimality in distributed settings is not trivial, e.g. the decentralized SGHMC algorithm also only achieves $\epsilon$-error within $\Omega(\frac{1}{\epsilon^2})$ iterations (see remark 42 of [5]).


[6] Federated Averaging Langevin Dynamics: Toward a unified theory and new algorithms. AISTAT'23.

[7] Federated Learning with a Sampling Algorithm under Isoperimetry. arXiv:2206.00920v2. 2022

[8] Qlsd: Quantised Langevin stochastic dynamics for Bayesian federated learning. 2022.





\paragraph{More practical settings in which SGLD is desirable with FL?}

Our paper primarily focuses on the theoretical analysis of the FA-LD algorithm. While there have been interesting experimental results based on a similar algorithm discussed in [1], our main contribution lies in presenting the first non-asymptotic convergence analysis for FA-LD in simulating strongly log-concave distributions on non-i.i.d data; we also investigate the trade-off between data privacy and accuracy through our provided differential privacy guarantees.

In addition to our theoretical findings, our empirical study on synthetic data confirms the validity of our theory, particularly concerning the optimal choice of local steps and other relevant properties. Furthermore, we conduct real-world experiments on MNIST and Fashion MNIST datasets to demonstrate the scalability of FA-LD. Moving forward, we aspire to extend the application of FA-LD to more practical scenarios in future works.

[1] Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms. In ICLR, 2021.

\paragraph{The motivation for studying SGLD in FL setting is not well supported.}

In the era of big data, data privacy has become a growing concern for users. To address this issue, federated learning proposes a solution where training data is kept on local clients, without sharing user data, and only gradient information is utilized to train a consensus model. Additionally, there is an increasing demand for theoretically-guaranteed uncertainty estimation and global optimization, which leads us to adopt the SGLD sampling algorithm.

However, our theoretical analysis reveals that SGLD is not communication efficient in federated learning. Instead, we find that FA-LD based on local updates, with the synchronization frequency depending on the condition number, proves to be an effective approach for alleviating communication overhead and providing theoretical guarantees for statistical inference.


\paragraph{A statement in 5.3.3 is confusing and (11) is the same with (7)}

Thank you for bringing the formatting issues to our attention. We have addressed them in the revised version. In order to highlight the changes, we have used the purple color for the corrections related to typos, discussions on challenges, and motivation. Additionally, we have utilized red color to draw attention to the specific concern raised by both reviewer Viny and 64eW regarding the tightness question.

\end{document}