
\section{Simulations}
\label{simulation_local_step}

For each client $c\in[N]$, where $N=50$, we first sample $\theta_c\in\mathbb{R}^2$ from a Gaussian distribution $N(0,\alpha I_2)$. Then, we sample $n_c$ points from  $N(\theta_c,\Sigma)$, where 
$
\Sigma=\left[\begin{matrix}
5 & -2\\
-2 & 1
\end{matrix}\right]
$. Denote each data point by $x_{c,i}$ for $i\in[n_c]$. Thus, $l(\theta;x_{c,i})=\frac{1}{2}(\theta-x_{c,i})^{\top}\Sigma^{-1}(\theta-x_{c,i})+\log(2\pi |\Sigma|^{\frac{1}{2}})$,  $\ell^c(\theta)=\sum_{i=1}^{n_c}l(\theta;x_{c,i})$, $f(\theta)=\sum_{c=1}^N\ell^c(\theta)$, and $f^c(\theta)=\frac{1}{p_c}\ell^c(\theta)$. We fix the temperature $\tau=1$. The target distribution follows $N(u,\frac{1}{n}\Sigma)$ with $u=\frac{1}{n}\sum_{c=1}^N\sum_{i=1}^{n_c}x_{c,i}$.  We run Algorithm \ref{alg:alg_main_paper_text_independent_noise}, \ref{alg:alg_main_paper_text_different_seeds} and \ref{alg:alg_main_text_partial_main} and repeat each experiment $R=300$ times. At the $k$-th communication round, we obtain a set of $R$ simulated parameters $\{\theta_{k,j}\}_{j=1}^R$, where $\theta_{k,j}$ denotes the parameter at the $k$-th round in the $j$-th independent run. The underlying distribution $\mu_k$ at round $k$ is approximated by a Gaussian variable with the empirical mean ${u}_{k}=\frac{1}{R}\sum_{j=1}^R\theta_{k,j}$ and covariance matrix ${\Sigma}_{k}=\frac{1}{R-1}\sum_{j=1}^R (\theta_{k,j}-{u}_k)(\theta_{k,j}-{u}_k)^{\top}$. 

 \begin{figure*}[htbp]
  \vspace{-0.1in}
    \centering
    \subfigure[Study of $K$]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{fig:optimalK}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/optimalK2.pdf}
    \end{minipage}%
    }%
    \subfigure[Study of $\gamma$]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{fig:alpha}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/alpha_trace2.pdf}
    \end{minipage}%
    }%
    \subfigure[Study of $\rho$]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{fig:rho}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/rho2.pdf}
    \end{minipage}%
    }%
    \subfigure[True density]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{fig:true_density}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/alpha0_true_density4.pdf}
    \end{minipage}%
    }%
    \subfigure[Empirical density]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{fig:empirical_density}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/alpha0_empirical_density4.pdf}
    \end{minipage}%
    }%
  \vskip -0.1in
  \caption{Convergence of FA-LD based on full devices. In Figure \ref{fig:optimalK},  points may coincide.}%; e.g., the points of $\gamma=1\times 10^{8}$ and $\gamma=4\times 10^{11}$ coincide at $K=3000$.}
  \label{figure:full_device}
  \vspace{-0.13in}
\end{figure*}

\begin{figure*}[htbp]
  \vspace{-0.08in}
    \centering
    \subfigure[\scriptsize{Full devices: $S=50$}]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{full_device_baseline}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/FA-LD_50_50.pdf}
    \end{minipage}%
    }%
    \subfigure[\scriptsize{Scheme I: $S=40$}]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{partial_device_s1_40}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/FA-LD_40_50_S1.pdf}
    \end{minipage}%
    }%
    \subfigure[\scriptsize{Scheme II: $S=40$}]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{partial_device_s2_40}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/FA-LD_40_50_S2.pdf}
    \end{minipage}%
    }%
    \subfigure[\scriptsize{Scheme I: $S=30$}]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{partial_device_s1_30}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/FA-LD_30_50_S1.pdf}
    \end{minipage}%
    }%
    \subfigure[\scriptsize{Scheme II: $S=30$}]{
    \begin{minipage}[t]{0.19\linewidth}
    \centering
    \label{partial_device_s2_30}
    \includegraphics[width=1.1in]{figures/simulation/partial_devices/FA-LD_30_50_S2.pdf}
    \end{minipage}%
    }%
%   \vskip -0.1in
  \caption {Convergence of FA-LD based on different device-sampling schemes. }%The full device updates adopt $S=50$ devices; the partial device settings choose $S=40$ and 30 devices, respectively.}
  \label{figure:partial_device}
%   \vspace{-0.13in}
\end{figure*}


\paragraph{Optimal local steps} We study the choices of local step $K$ for Algorithm \ref{alg:alg_main_paper_text_independent_noise} based on different $\alpha$ defined in the beginning of this section, which corresponds to different levels of data heterogeneity modelled by $\gamma$. We choose $\alpha=0, 1, 10, 100, 1000$ and the corresponding $\gamma$ is around $1\times 10^{8},4\times 10^{11}, 4\times 10^{12}, 4\times 10^{13}$, and $4\times 10^{14}$, respectively.  We fix $\eta=10^{-7}$. We evaluate the (log) number of communication rounds to achieve the  accuracy $\epsilon=10^{-3}$ and denote it by $T_{\epsilon}$. As shown in Figure \ref{fig:optimalK}, a small $K$ leads to an excessive amount of communication costs; by contrast, a large $K$ results in large biases, which in turn requires high communications. The optimal $K$ that minimizes communication is around 3000 %that achieves the minimal communication rounds under different $\gamma$ 
and \emph{the communication savings can be as large as 30 times}. %As $\gamma$ increases, the value of the optimal $K$ also increases naturally. 


\vspace{-2mm}
\paragraph{Data heterogeneity and correlated noise} We study the impact of $\gamma$ on the convergence of Algorithm \ref{alg:alg_main_paper_text_independent_noise} based on different $\gamma$ from $\{1\times 10^{8},4\times 10^{11}, 4\times 10^{12}, 4\times 10^{13}$, and $4\times 10^{14}\}$. We set $K=10$. As shown in Figure \ref{fig:alpha}, the $W_2$ distances under different $\gamma$ all converge to some levels around $10^{-3}$ after sufficient computations. Nevertheless, a larger $\gamma$ does slow down the convergence, which suggests adopting more balanced data to facilitate the computations. In Figure \ref{fig:rho}, we study the impact of $\rho$ on the convergence of Algorithm \ref{alg:alg_main_paper_text_different_seeds}. We choose $K=100$ and $\gamma=10^8$ and observe that a larger correlation slightly accelerates the computation, although it risks in privacy concerns.

\vspace{-2mm}
\paragraph{Approximate samples} In Figure \ref{fig:empirical_density}, we plot the empirical density according to the samples generated by Algorithm \ref{alg:alg_main_paper_text_independent_noise} with $K=10$ and $\gamma=10^{8}$, $\eta=10^{-7}$. For comparison, we show the true density plot of the target distribution in Figure \ref{fig:true_density}. The empirical density approximates the true density very well, which indicates that the simulation potential of FA-LD in federated settings.

\paragraph{Partial device participation} We study the convergence of two popular device-sampling schemes I and II. We fix the number of local steps $K=100$ and the total devices $N=50$. We try to sample $S$ devices based on different fixed learning rates $\eta$. The full device updates are also presented for a fair evaluation. As shown in Figure \ref{full_device_baseline}, larger learning rates converge faster but lead to larger biases; small learning rates, by contrast, yield diminishing biases consistently, where is in accordance with Theorem \ref{main_paper_theorem}. However, in partial device scenarios, the bias becomes much less dependent on the learning rate in the long run. We observe in Figure \ref{partial_device_s1_40}, Figure \ref{partial_device_s2_40}, Figure \ref{partial_device_s1_30}, and Figure \ref{partial_device_s2_30} that the bias caused by partial devices becomes dominant as we decrease the number of partial devices $S$ for both schemes. Unfortunately, such a phenomenon still exists even when the algorithms converge, which suggests that the proposed partial device updates may be only appropriate for the early period of the training or simulation tasks with low accuracy demand.