\ifdefined\isarxiv
\else 
\vspace{-0mm}
\fi
\section{Related work}\label{sec:related}
\ifdefined\isarxiv
\else
\vspace{-0mm}
\fi

\paragraph{Federated Learning}
Current federated learning follows two paradigms. The first paradigm asks every client to learn the model using private data and communicate in model parameters. The second one uses encryption techniques to guarantee secure communication between clients.  In this paper, we focus on the first paradigms \cite{dcm+12,ss15,mmra16,mmr+17,hlsy21}. 
There is a long list of works showing provable convergence algorithm for FedAvg types of algorithms in the field of optimization \cite{lhy+20,ljz+21,hlsy21,kmr19,yyz19,wts+19,kkm+20}. One line of research \cite{lhy+20,kmr19,yyz19,wts+19,kkm+20} focuses on standard assumptions in optimization (such as, convex, smooth, strongly-convex, bounded gradient). The other line of work \cite{ljz+21,hlsy21} proves the convergence in the regime where the model of interest is an over-parameterized neural network (also called NTK regime \cite{jgh18}). Extensions to general partial device participation, and arbitrary communication schemes have been well addressed in \cite{Avdyukhin21, Haddadpour19}.


\paragraph{Scalable Monte Carlo methods} SGLD \cite{Welling11} is the first stochastic gradient Monte Carlo method that tackles the scalability issue in big data problems. Ever since, variants of stochastic gradient Monte Carlo methods were proposed to accelerate the simulations by utilizing more general Markov dynamics \cite{yian2015,completeframework,Chen14}, Hessian approximation \cite{Ahn12}, parallel tempering \cite{deng2020}, as well as higher-order numerical schemes \cite{Chen15, Li19,ccbj18,Yian_underdamped,Yian_higher,shen2019randomized}.


\paragraph{Distributed Monte Carlo methods}

Sub-posterior aggregation was initially proposed in \cite{Neiswanger13, wang13, Minsker14} to accelerate MCMC methods to cope with large datasets. Other parallel MCMC algorithms \cite{Nishihara14, Ahn14_icml, chen16_distributed, Chowdhury18, Li19_v2} propose to improve the efficiency of Monte Carlo computation in distributed or asynchronous systems. \cite{gghz20} proposed stochastic gradient Monte Carlo methods in decentralized systems. \cite{agxr21, F-SGLD, hw+21} introduced empirical studies of posterior averaging in federated learning. 

\paragraph{A Concurrent Work}
Parallel to our work, QLSD \cite{Maxime2021} also studied the convergence of SGLD in federated settings and a compression operator was proposed to alleviate the communication overhead. By contrast, we follow the tradition in the FL community and achieve this target by solely conducting multiple local steps to balance accuracy and communication. Other interesting Bayesian federated learning algorithms can be found in \cite{FedPop, Personalized_FL_Bayes}. Our averaging scheme is deterministic and may be limited when activating all the devices is costly; we also refer interested readers to the study of federated averaging Langevin dynamics based on a probabilistic averaging scheme in \cite{Vincent_2022}.





% \paragraph{Notation}

% For any positive integer $n$, we use $[n]$ to denote the set $\{1,2,\cdots,n\}$.
% Let $N$ denote the number of clients.  For each $c \in [N] $, we use $f^c$ and $\nabla f^c$ as the loss function and gradient of the function $f^c$ in client $c$. $\nabla \tilde f^c(\cdot)$ is the \emph{unbiased} stochastic gradient of $\nabla f^c$. In addition, we denote $p_c$ as the weight of the $c$-th client such that $p_c=\frac{n_c}{\sum_{i=1}^{N} n_i}\in(0, 1)$, where $n_c>0$ is the number of data points in the $c$-th client. Let $T_{\epsilon}$ denote the number of global steps to achieve the precision $\epsilon$. Let $K$ denote the number of local steps and hence ${T_{\epsilon}}/{K}$ denotes the number of communications.
