\documentclass{article}

% Language setting
% Replace `english' with e.g. `spanish' to change the document language
\usepackage[english]{babel}
\usepackage{amsmath,amssymb}

% Set page size and margins
% Replace `letterpaper' with `a4paper' for UK/EU standard size
\usepackage[letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}

% Useful packages
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage[colorlinks=true, allcolors=blue]{hyperref}



\begin{document}


\begin{itemize}
    \item error term not scale invariant
    \item comment on the optimality of local step K
    \item compare with durmus
    \item compare with Sun.
    % \item xx
\end{itemize}

reply to reviewer 1


Thank you for the reviewer's valuable recognition of our contributions.


When it comes to the error terms in Lemma 5.6, we acknowledge that there are multiple possible interpretations, depending on the specific property we prioritize. One significant factor contributing to this is the varying magnitudes of the parameters of interest. Consequently, it becomes necessary to exclude certain elements in order to streamline the analysis and concentrate on local properties. Given $H_0\geq \frac{\tau N}{m}\geq \frac{\tau}{m}\geq \frac{\tau}{m}$ defined in A.4, applying Lemma B.5 (or 5.6) recursively $k$ times leads to \begin{equation}W^2_2(\mu_{k}, \pi)\lesssim (1-\eta m/4)^k W^2_2(\mu_{0}, \pi)+\mathcal{U},
\end{equation}
where $\mathcal{U}$ follows that
\begin{equation}
\begin{split}\label{ori_eqn}
    \mathcal{U}&=\eta d \kappa \tau \bigg(\frac{1}{\tau}\textcolor{red}{\eta (K-1)^2} L^2 H_0 + (K-1) N + \frac{\eta}{\tau} L^2  \kappa H_0+ \frac{L}{m} + \frac{\sigma^2}{L\tau} \bigg).
\end{split}
\end{equation}

We observe that the equation above contains both terms $(K-1)^2$ and $K-1$. Interestingly, similar findings were obtained in Theorem 1 of [1], utilizing an alternative stochastic averaging scheme. In their analysis, the upper bound presented below is also bounded by multiple terms, such as:
\begin{equation}
\begin{split}
W^2_2(\mu_k, \pi)&\lesssim (1-\eta m/8)^k W^2_2(\mu_0, \pi) +\eta +\frac{\eta}{\mathbb{P}_c} + \frac{\eta^2 (1-\mathbb{P}_c)}{\mathbb{P}_c^2}\big\{1+\mathbb{P}_c \big\} \\
&=(1-\eta m/8)^k W^2_2(\mu_0, \pi) +\eta \bigg(1 +\eta+\frac{1}{\mathbb{P}_c} + \textcolor{red}{\frac{\eta}{\mathbb{P}^2_c}}\bigg) 
\end{split}
\end{equation}
where $\mathbb{P}_c\in[0, 1]$ represents the probability that the client communicates the parameters, and $\frac{1}{\mathbb{P}_c}$ signifies a concept similar to our local step $K$ in deterministic averaging schemes. In order to streamline the analysis and focus on relevant aspects, certain items that are not of immediate interest are omitted. In particular, we observe that the upper bound \textbf{depends on both $\frac{\eta}{\mathbb{P}_c^2}$ and $\frac{1}{\mathbb{P}_c}$}, which is consistent with our result in Eq.\eqref{ori_eqn}. %However, we acknowledge that the scale-variant upper bound is mainly an artifact in federated settings to facilitate analysis. 
% It also can be interpreted in the formulation as suggested by the reviewer given a reasonable fact that $\eta K\leq O(1)$. 



If we plug in the standard inequality $\eta\leq \frac{1}{2L}$ (see Theorem 1) into Eq.\eqref{ori_eqn}, we have
\begin{equation}
\begin{split}
    \eta m \mathcal{U}&= \eta^2 d L \tau \bigg(\frac{1}{\tau}\eta (K-1)^2 L^2 H_0 + (K-1) N + \frac{\eta}{\tau} L^2  \kappa H_0+ \frac{L}{m} + \frac{\sigma^2}{L\tau} \bigg)\\
    &\lesssim \eta^2 d L^2 H_0 \bigg( (K-1)^2  +   \kappa \bigg)\\
\end{split}
\end{equation}

When we take into consideration a reasonable assumption, $\eta K\leq O(1)$, in Eq. \eqref{ori_eqn}, it results in the derivation proposed by the reviewer:
\begin{equation}
\begin{split}
    \eta m \mathcal{U}&=\eta^2 d L \tau \bigg(\frac{1}{\tau}\eta (K-1)^2 L^2 H_0 + (K-1) N + \frac{\eta}{\tau} L^2  \kappa H_0+ \frac{L}{m} + \frac{\sigma^2}{L\tau} \bigg)\\
    &\lesssim \eta^2 d L\tau \bigg( (K-1)N  +   \kappa \bigg)\\
\end{split}
\end{equation}

\textbf{Scale invariance in Lemma 5.6}

We introduce $H_{\rho}$ to aid in the presentation, although it may unavoidably affect the upper bound. Nonetheless, we believe that our derivation remains consistent with the existing literature.

For example, Theorem 4 in [4] (with a default temperature 1) follows that
\begin{align}
    W_2(\mu_{k}, \pi) &\lesssim  \left(1-\eta m \right)^k \cdot W_2(\mu_{0}, \pi)+\kappa \sqrt{{\eta} d}  + \frac{\sigma^2 \sqrt{\eta d}}{L+\sigma \sqrt{m}}.\label{user-friendly}
\end{align}

When it comes to our result. Plugging in the definition of $H_{\rho}$ with $K=1$, $\tau=1$, and ignoring the divergence term $\gamma=0$, our Theorem 5.7 shows that
\begin{align*}
    W_2(\mu_{k}, \pi) &\lesssim  \left(1-\frac{\eta m}{4}\right)^k \cdot W_2(\mu_{0}, \pi)+\kappa^{1.5}\sqrt{{\eta} d} \cdot \sqrt{m H_0} \\
    &\lesssim \left(1-\frac{\eta m}{4}\right)^k \cdot W_2(\mu_{0}, \pi)+\kappa^{1.5}\bigg(\underbrace{\sqrt{\eta d m\mathcal{D}^2}}_{\text{I}} + \underbrace{\sqrt{\eta d}+\frac{\sigma\sqrt{\eta d}}{\sqrt{m}}}_{\text{II}}\bigg),
\end{align*}
where the crucial term $\text{II}$ is consistent with Eq.\eqref{user-friendly} in terms of scales; the first term $\text{I}$ from initialization is inevitable (similar to Theorem 1 in [6] or Lemma 3.2 [7]). Moreover, our divergence term $\frac{\gamma^2}{m^2 d}$ is in the order as $\frac{\sigma^2}{m^2}$ and hence validates our result.


The scale-invariance property is a useful method for sanity checks. However, it's worth noting that incorporating the additional Brownian motion in the context of federated learning often introduces multiple additional terms, which can make deriving a concise upper bound challenging compared to what is typically observed in the optimization community.  


\textbf{Comparison with a sharper analysis in [3] and discussions on the optimality of local steps}


Our Theorem 5.7 aligns with Theorem 1 [4] in terms of dimensional dependence, as both achieve optimal results with an $\epsilon$-error in $\Omega(\frac{d}{\epsilon^2})$ iterations, similar to the findings in [3] when treating $W_2(\mu_0, \pi)$ as a constant. However, we acknowledge that our dependence on the condition number $\kappa$ is not as favorable as the one provided in Eq(22) of [3]. We believe that their analysis can assist us in refining our dependence on $\kappa$ and improves our understanding of the optimal number of local steps $K$ (since it depends on $\kappa$).

While our suggestions regarding the optimal number of local steps $K$ are subject to change, considering the scaled stochastic gradient variance (denoted as $\eta \mathbb{E}\big[|\bar\theta_{k\eta}^c-\bar\theta_{k\eta}|_2^2\big]$ in Eq. 24 of our work) as a trivial factor, we note that investigating the stochastic gradient variance is often crucial in federated learning problems with large datasets.

% It is worth noting that the proposed choice on the number of local steps $K$ is only locally optimal, given that the stochastic gradient variance (scaled by $\eta$) $\eta \mathbb{E}\big[\|\bar\theta_{k\eta}^c-\bar\theta_{k\eta}\|_2^2\big]$ in Eq.24 of our work is a significant item. In addition, the choice on $K$ can be further refined considering the extensions based on variance and bias reduction techniques [1].


\textbf{Comparison with Sun, 2022 [2]}

In addition, our work predominantly focuses on convex scenarios. However, for those interested in non-convex scenarios, we would like to direct readers to a noteworthy study conducted by Sun et al. (2022), which assumes the Logarithmic Sobolev Inequality (LSI) to hold. The LSI assumption allows for the consideration of multi-modal distributions and provides theoretical guarantees for more practical applications. Furthermore, they achieve a similar computational complexity and both require $\Omega(\frac{d}{\epsilon^2})$ iterations to achieve an $\epsilon$-accuracy target in terms of the 2-Wasserstein distance.

Although both [2] and [5] leverage the compression operator to reduce communication costs in federated learning, which may be less communication efficient than the local-step update, these works intriguingly lay the foundation for future studies on Bayesian federated learning in non-convex scenarios based on local-step schemes.


Thank you once again for providing insightful comments that have greatly enhanced the presentation of this paper. If you have any additional questions or require further clarification, we would be delighted to engage in further discussion with you.

% add some thing (computational complexity around our error complexity analysis in later section)


% 1. Relies on the compressed operator (they wrote: Comparing the communication complexity of Langevin-MARINA to that of FA-LD Deng et al. [2021] is more difficult because FA-LD makes communication savings by performing communication rounds only after a number T of local updates)

% 2. allows F to be nonconvex

% 3. similar computational complexity;

% \begin{table}[ht]
% 	\centering
% \begin{tabular}{ |c|c|c|c|c| }
% \hline
% & & & \\
%  Paper & Assumption & Communication Overhead & Criterion & Complexity \\ 
%  & & & \\
%  \hline
%  & & & \\
% Vono 2022 & Strong convexity & Compression & $W_2(\rho_K,\pi)< \varepsilon$ & $K = \tilde{\mathcal O} \left(\frac{d}{\varepsilon^2}\right)$ \\
% & & & \\
% \hline
% & & & \\
% Sun 2022 & {log Sobolev inequality} & Compression & $W_2(\rho_K,\pi)< \varepsilon$ & $K = \tilde{\mathcal O} \left(\frac{d}{\varepsilon^2}\right)$ \\
% & & & \\
% \hline
% & & & \\
% Plassier 2023 & {Strong convexity} & (Stochastic) Local Steps & $W_2(\rho_K,\pi)< \varepsilon$ & $K = \tilde{\mathcal O} \left(\frac{d}{\varepsilon^2}\right)$ \\
% & & & \\
% \hline
% & & & \\
% This paper & Strong convexity & Local Steps  & $W_2(\rho_K,\pi)< \varepsilon$ & $K = \tilde{\mathcal O} \left(\frac{d}{\varepsilon^2}\right)$ \\
% & & & \\
% \hline
% \end{tabular}
% \\
% \caption{to do edited: Sufficient number of iterations of our algorithm and concurrent algorithms to achieve $\varepsilon$ accuracy in 2-Wasserstein distance in dimension $d$. Strong convexity implies log Sobolev inequality.}
% \label{tab:complexity}
% \end{table}%

[1] Federated Averaging Langevin Dynamics: Toward a unified theory and new algorithms. AISTAT'23.

[2] Federated Learning with a Sampling Algorithm under Isoperimetry. arXiv:2206.00920v2. 2022

[3] Analysis of Langevin Monte Carlo via Convex Optimization. JMLR. 2019.

[4] User-friendly guarantees for the Langevin Monte
Carlo with inaccurate gradient. arXiv:1710.00095v3. Stochastic Processes and their Applications. 2019

[5] QLSD: Quantised Langevin Stochastic Dynamics for Bayesian Federated Learning. AISTAT, 2022.

[6] Underdamped Langevin MCMC: A non-asymptotic analysis. COLT'17

[7] Non-Convex Learning via Stochastic Gradient Langevin
Dynamics: A Nonasymptotic Analysis. arXiv:1702.03849v3. 2017.

\end{document}