%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors

\usepackage{pifont}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

\usepackage{bm,amsmath,amsthm,amssymb,multicol,enumitem,subfigure}
\usepackage{xargs}
\usepackage{stmaryrd}
\usepackage{natbib}
\usepackage{comment}
% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
\usepackage{booktabs} % for professional tables
\usepackage{algpseudocode}
\usepackage{algorithm}

\usepackage[T1]{fontenc}
% \usepackage{enumerate}
\usepackage{inputenc}

\usepackage{graphicx} % more modern
\usepackage{subfigure}
\renewcommand*{\thesubfigure}{}
\usepackage{booktabs,balance}
\usepackage{rotating}
\usepackage{boldline}
\usepackage{makecell}
\usepackage{multirow}
\usepackage{balance}



% \usepackage[colorlinks,linkcolor=red,filecolor=blue,citecolor=blue,urlcolor=blue]{hyperref}

% \usepackage{url}
% % \usepackage[algo2e,ruled,noend]{algorithm2e}
% \newcommand\mycommfont[1]{\footnotesize\ttfamily\textcolor{blue}{#1}}
% \SetCommentSty{mycommfont}
% \setlength{\algomargin}{4pt}

\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
\DeclarePairedDelimiter{\floor}{\lfloor}{\rfloor}
\newcommand{\pl}{Polyak-\L{}ojasiewicz}
\newcommand{\todoM}[1]{\textcolor{blue}{ToDo (Farzin): #1}}
\newcommand{\todo}[1]{\textcolor{red}{ToDo:~#1}}
\newcommand{\alert}[1]{\textcolor{red}{#1}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\E}{\mathrm{E}}

\theoremstyle{plain}
\newtheorem{theo}{Theorem}
\newtheorem{remark}[theo]{Remark}
\newtheorem{proposition}[theo]{Proposition}
\newtheorem{lem}[theo]{Lemma}
\newtheorem{coro}[theo]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theo]{Definition}
\newtheorem{assumption}[theo]{Assumption}


\def\M{\mathcal{M}}
\def\A{\mathcal{A}}
\def\Z{\mathcal{Z}}
\def\S{\mathcal{S}}
\def\D{\mathcal{D}}
\def\R{\mathcal{R}}
\def\P{\mathcal{P}}
\def\K{\mathcal{K}}
\def\E{\mathbb{E}}
\def\F{\mathfrak{F}}
\def\l{\boldsymbol{\ell}}


\newtheorem*{Lemma*}{Lemma}
\newtheorem*{Theorem*}{Theorem}
\newtheorem*{Corollary*}{Corollary}

\newcommand{\eqsp}{\;}
\newcommand{\beq}{\begin{equation}}
\newcommand{\eeq}{\end{equation}}
\newcommand{\eqdef}{\mathrel{\mathop:}=}
\def\EE{\mathbb{E}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\pscal}[2]{\left\langle#1\,|\,#2 \right\rangle}
\def\major{\mathsf{M}}
\def\rset{\ensuremath{\mathbb{R}}}
\newcommand{\inter}{\llbracket n \rrbracket}
\newcommand{\interl}{\llbracket L \rrbracket}

\def\tot{\mathsf{h}}

\newcommand{\sign}{\text{sign}}
\newcommand{\ie}{{\em i.e.,~}}

\newcommand{\algo}{\textsc{Fed-LAMB}}





\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{multirow}
\usepackage{makecell}
\usepackage{pifont}
\usepackage{mathtools}
\usepackage{balance}

\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{tikzmark,calc}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Fed-LAMB: Layer-wise and Dimension-wise \\Locally Adaptive
	Federated Learning}



\author{Belhal Karimi, Ping Li, Xiaoyun Li\\
	Cognitive Computing Lab\\
	Baidu Research\\
	10900 NE 8th St, Bellevue, WA 98004, USA \\
	\texttt{\{belhal.karimi, pingli98, lixiaoyun996\}@gmail.com}
}



\begin{document}
	\maketitle
	
	\begin{abstract}
		In the emerging paradigm of Federated Learning (FL), large amount of clients such as mobile devices are used to train possibly high-dimensional models on their respective data. Combining (\textit{dimension-wise}) adaptive gradient methods (e.g., Adam, AMSGrad)
		with FL has been an active direction, which is shown to outperform traditional SGD based FL in many cases. In this paper, we focus on the problem of training federated deep neural networks, and propose a novel FL framework which further introduces \emph{layer-wise} adaptivity to the local model updates to accelerate the convergence of adaptive FL methods. Our framework includes two variants based on two recent locally adaptive federated learning algorithms. Theoretically, we provide a convergence analysis of our layer-wise FL methods, coined Fed-LAMB and Mime-LAMB, which match the convergence rate of state-of-the-art results in adaptive FL and exhibits linear speedup in terms of the number of workers. Experimental results on various datasets and models, under both IID and non-IID local data settings, show that both Fed-LAMB and Mime-LAMB achieve faster convergence speed and better generalization performance, compared to various recent adaptive FL methods.\vspace{-0.05in}
	\end{abstract}
	
	
	
	\section{Introduction}\label{sec:introduction}
	
	A growing and important task while learning models on observed data, is the ability to train over a large number of clients which could either be personal devices or distinct entities.
	In the paradigm of Federated Learning (FL)~\citep{konevcny2016federated,mcmahan2017communication}, a central server orchestrates the optimization over those clients under the constraint that the data can neither be gathered nor shared among the clients.
	This is computationally more efficient, since more distributed computing resources are used; also, this is a very practical scenario which allows individual data holders (e.g., mobile devices) to train a  model jointly without leaking private data. 
	In this paper, we consider the following optimization problem:
	\begin{equation}\label{eq:opt}
	\min_{\theta} f(\theta) \eqdef \frac{1}{n} \sum_{i=1}^n f_i(\theta)= \frac{1}{n} \sum_{i=1}^n \mathbb E_{\xi\sim \mathcal X_i}[F_i(\theta;\xi)],
	\end{equation}
	where the nonconvex function (e.g., deep networks) $f_i$ represents the average loss over the local data samples for worker $i \in \inter$, and $\theta \in \mathbb R^d$ the global model parameter. 
	$\mathcal X_i$ is the data distribution on each client $i$. There are two general scenarios of FL~\citep{yang2019federated}: (i) \textit{cross-silo} setting where $n$ is small/moderate and the clients can be, e.g., different data servers; (ii) \textit{cross-device} setting, where $n$ can be large (e.g., millions) and the clients are mobile devices. While \eqref{eq:opt} reminds that of standard distributed optimization, the principle and setting of FL are different from the classical distributed paradigm. Two of the main differences are: (i) Local updates: FL allows clients to perform multiple updates on the local models before the global aggregation, which improves the computational resource efficiency and reduces the frequency of communication; (ii) Data heterogeneity: in FL, the local data distributions $\mathcal X_i$ are usually different across workers, hindering the convergence of the global model. Federated learning aims at finding a global solution of \eqref{eq:opt} in fewest number of communication rounds. 
	
	One of the standard and most popular frameworks for FL is called Fed-SGD~\citep{mcmahan2017communication}: we adopt multiple local Stochastic Gradient Descent (SGD) steps in each device, send those local models to the server that computes the average over the received local model parameters, and broadcasts it back to the devices. Moreover, momentum can be added to local SGD training for faster convergence and better learning performance~\citep{yu2019linear}. On the other hand, adaptive gradient methods (e.g., Adam~\citep{kingma2015adam}, AMSGrad~\citep{reddi2019convergence}) have shown great success in many deep learning tasks. For instance, the update rule of Adam at step $t$ reads as
	\begin{equation}\label{rule:adam}
	\begin{aligned}
	\theta_t=\theta_{t-1}-\alpha\frac{m_t}{\sqrt{v_t}},\ \quad &m_t=\beta_1 m_{t-1}+(1-\beta_1) g_t,\\
	& v_t=\beta_2 v_{t-1}+(1-\beta_2) g_t^2,
	\end{aligned}
	\end{equation}
	where $\alpha$ is the learning rate and $g_t$ is the gradient at time $t$. 
	We note that the effective learning rate of Adam is $\alpha/\sqrt{v}$, which is different across dimensions, i.e., \textit{dimension-wise} adaptive. Recently, we have seen growing research efforts in the design of FL frameworks that adopt adaptive gradient methods as the protocols for local model training instead of SGD. 
	Examples include federated AMSGrad (Fed-AMS)~\citep{chen2020toward} and Mime~\citep{karimireddy2020mime} with Adam updates. 
	Specifically, in both methods, in each round the global server not only aggregates the local models, but also broadcasts to the workers a ``global'' second moment estimation to reconcile the dimension-wise adaptive learning rates across the clients. Therefore, this step can be regarded as a natural mitigation to data heterogeneity, which is a common and important practical scenario that affects the performance of FL algorithms~\citep{li2019federated,liang2019variance,karimireddy2019scaffold}. Adaptive-optimizer based FL have been shown to outperform many SGD based FL methods on various tasks, making it a promising direction in FL system design.
	
	In this work, we specifically focus on further improving the convergence speed and learning performance of locally adaptive FL algorithms. Our construction is based on introducing a special learning rate schedule into the local training of FL, which has not been proposed in the literature before. For (single-machine) training of deep neural networks using Adam, \citep{you2019large} proposed a \textit{layer-wise} adjusted learning rate scheme called LAMB, where in each update, the effective update $m_t/\sqrt{v_t}$ is further normalized by the weight of each layer in the deep neural network, respectively. In \citep{you2019large}, the authors proved that LAMB matches the convergence rate of Adam theoretically, and demonstrated the superior performance of LAMB empirically. With this weight-dependent adjusted learning rates, LAMB allows large-batch training which could in particular speed up training large datasets and models like ImageNet~\citep{deng2009imagenet} and BERT~\citep{devlin2019bert}.
	
	\vspace{0.05in}
	\noindent\textbf{Contributions.} Despite that layer-wise learning rate has been successfully implemented in (single-machine) model learning, one question that has not been explored is: can we also use methods like LAMB in the local training in federated learning? Is it able to also speedup the global model convergence? In this paper, we propose an improved framework for locally adaptive FL algorithms, integrating both \emph{dimension-wise} and \emph{layer-wise} adaptive learning rates in each device's local update. We provide theoretical and empirical justification on the efficacy of such layer-wise adaptivity in local federated training. More specifically, our contributions are summarized as follows:
	
	\begin{itemize}
		\item We develop Fed-LAMB and Mime-LAMB, two instances of our layer-wise adaptive optimization framework for FL, following a layer-wise adaptive strategy to accelerate the training of deep neural networks. 
		
		
		\item We show that our algorithm converges at the rate of $\mathcal{O}\left(\frac{1}{\sqrt{n\tot R}} \right)$ to a stationary point, where $\tot$ is the number of layers of the network, $n$ is the number of clients and $R$ is the number of communication rounds. This matches the convergence rate of LAMB, AMSGrad, as well as the state-of-the-art results in federated learning. The theoretical communication efficiency matches that of Fed-AMS~\citep{chen2020toward}.
		
		\item We empirically compare several recent adaptive FL methods under both homogeneous and heterogeneous data setting on various benchmark datasets. 
		Our results confirm the accelerated empirical convergence of Fed-LAMB and Mime-LAMB over the baseline methods, including Fed-AMS and Mime. 
		In addition, Fed-LAMB and Mine-LAMB can also reach similar, or better, test accuracy than their corresponding baselines.
	\end{itemize}
	
	\begin{figure*}[t]
		\centering
		\includegraphics[width=4.7in]{figure_final/plot1.pdf}
		
		
		\caption{Illustration of Fed-LAMB framework (Algorithm~\ref{alg:ldams}), with a three-layer network and $\phi(x)=x$ as an example. For device $i$ and each local iteration in round $r$, the adaptive ratio of $j$-th layer $\psi_{r,i}^j$ is normalized according to $\Vert \theta_{r,i}^j\Vert$, and then used for updating the local model. At the end of each round $r$, client $i$ sends $\theta_{r,i} =  [\theta_{r,i}^{\ell}]_{\ell =1}^{\tot}$ and $v_{r,i}$ to the central server, which transmits back aggregated $\theta$ and $\hat v$ to devices to complete a round of training.}
		\label{fig:illustrate}
	\end{figure*}
	
	\section{Background}\label{sec:related}
	
	We summarize some relevant work on adaptive optimization, layer-wise adaptivity and federated learning.
	
	\noindent\textbf{Adaptive gradient methods.}
	Adaptive methods have proven to be the spearhead for many nonconvex optimization tasks.
	Gradient based optimization algorithms alleviate the possibly high nonconvexity of the objective function by adaptively updating each coordinate of their learning rate using past gradients. 
	Common used examples include RMSprop~\citep{tieleman2012rmsprop}, Adadelta~\citep{zeiler2012adadelta}, Adam~\citep{kingma2015adam}, Nadam~\citep{dozat2016incorporating} and AMSGrad~\citep{reddi2019convergence}.
	Their popularity owes to their great performance in training deep neural networks.
	They generally combine the idea of adaptivity from AdaGrad~\citep{duchi2011adaptive,mcmahan2010adaptive}, as explained above, and the idea of momentum from Nesterov's Method~\citep{nesterov2003introductory} or Heavy ball method~\citep{polyak1964some} using past gradients.
	AdaGrad displays superiority when the gradient is sparse compared to other classical methods~\citep{duchi2011adaptive}. Yet, when applying AdaGrad to train deep neural networks, it is observed that the learning rate might decay too fast. Consequently,~\citep{kingma2015adam} developed Adam whose updating rule is presented in \eqref{rule:adam}. 
	A variant, called AMSGrad~\citep{reddi2019convergence}, forces $v$ to be monotone to fix the convergence issue. \citep{loshchilov2019decoupled} proposed AdamW that combines weight decay with Adam. The convergence and generalization of adaptive methods and their application in decentralized learning are studied in, e.g.,~\citep{zhou2018convergence,chen2019convergence,zhou2020towards,wang2021optimistic, chen2022convergence}, among others. 
	
	\vspace{0.05in}
	\noindent\textbf{Layer-wise Adaptivity.} When training deep networks, in many cases the scale of gradients differs a lot across the network layers. When we use the same learning rate for the whole network, the update might be too conservative for some specific layers (with large weights) which may slow down the convergence. Based on this observation, \citep{you2018imagenet} proposed LARS, an extension of SGD with layer-wise adjusted scaling, whose performance, however, is not consistent accross tasks. Later, \citep{you2019large} proposed LAMB, an analogous layer-wise adaptive variant of Adam. The update rule of LAMB for the $\ell$-th layer of the network can be expressed as
	\begin{align*}
	\theta_t^\ell=\theta_{t-1}^\ell-\frac{\alpha \| \theta_{t-1}^\ell\|}{\|\psi_t^\ell\|}\psi_t^\ell,\ \text{with}\ \psi_t^\ell=m_t^\ell/\sqrt{v_t^\ell},
	\end{align*}
	where $m_t$ and $v_t$ are defined in \eqref{rule:adam}. Intuitively, for the $\ell$-th layer, when the gradient magnitude is too small compared to the scale of the model parameter, we increase the effective learning rate to make the model move sufficiently far. Theoretically, \citep{you2019large} showed that LAMB achieves the same convergence rate as Adam; empirically, LAMB can significantly accelerate the convergence of Adam, allowing the use of large mini-batch size with fewer training iterations for large datasets. 
	
	
	\vspace{0.05in}
	\noindent\textbf{Federated learning.}
	An extension of the classic distributed training paradigm is called Federated Learning (FL)~\citep{konevcny2016federated,mcmahan2017communication} which has seen many applications in various fields~\citep{yang2019federated,leroy2019federated,bonawitz2019towards,niknam2020federated,xu2021federated}. For Fed-SGD (where clients perform SGD-based updates), recent variants and theoretical analysis on the convergence can be found in~\citep{yu2019linear,karimireddy2019scaffold,khaled2020tighter,li2020convergence,woodworth2020is,wang2020slowmo}.
	
	
	Many works have considered adaptive gradient methods in FL. \citep{reddi2021adaptive} proposed Adp-Fed where the central server applies Adam-type updates and the local clients perform SGD updates. \citep{li2022distributed,li2023analysis} studied distributed and federated adaptive method under communication compression. 
	\citep{chen2020toward,karimireddy2020mime} proposed Fed-AMS and Mime respectively, to adopt Adam/AMSGrad at the client level. Both works mitigate the influence of data heterogeneity by ``sharing'' the second moment $v$ which controls the effective learning rates (more details will be provided later). Locally adaptive FL has also been applied to decentralized training~\citep{zhao2022communication}. On many tasks, these methods outperform Fed-SGD and other popular methods like SCAFFOLD~\citep{karimireddy2019scaffold} and FedProx~\citep{li2020federatedprox,yuan2022convergence}. \citet{charles2021large} empirically tested a FL method where LARS (i.e., layer-wise SGD)~\citep{you2018imagenet} is applied at the central server in local SGD, which is very different from our work in that the layer-wise adjustment happens locally with AMSGrad as the local optimizer. That is, our local models are trained with \textit{dual} adaptivity.
	
	
	\section{Layer-wise Locally Adaptive Federated Learning}\label{sec:main}
	
	
	In this section, we introduce our proposed FL framework, admitting both \textit{dimension-wise} adaptivity (of adaptive learning rate) and \textit{layer-wise} adaptivity (of layer-wise scaling). We mainly consider AMSGrad~\citep{reddi2019convergence} as the prototype adaptive gradient method. We assume the loss function $f(\cdot)$ is induced by a multi-layer neural network, which includes a broad class of network architectures like MLP, CNN, ResNet and Transformers. 
	
	\vspace{0.05in}
	\noindent\textbf{Notations.} We denote by $\theta$ the vector of parameters taking values in $\rset^p$. 
	Suppose the neural network has $\tot$ layers, each with size $p_\ell$ (thus, $p= \sum_{\ell=1}^\tot p_\ell$). For each layer $\ell \in \llbracket \tot \rrbracket$, denote $\theta^\ell$ as the sub-vector corresponding to the $\ell$-th layer. Let $R$ be the number of communication rounds and $T$ be the number of local iterations per round. Moreover, $\theta_{r,i}^{\ell,t}$ is the model parameter of layer $\ell$ at round $r$, local iteration $t$ and for worker $i$.
	
	
	\begin{algorithm}[tb]
		\caption{ \colorbox{blue!20!white}{Fed-LAMB} and \colorbox{red!20!white}{Mime-LAMB} } \label{alg:ldams}
		\begin{algorithmic}[1]
			%\small
			\State \textbf{Input}: $0< \beta_1, \beta_2 <1$; learning rate $\alpha$; weight decaying rate $\lambda \in [0,1]$; frequency parameter $Z$.
			\State \textbf{Initialize}: $\theta_{0,i} \in \Theta \subseteq \mathbb R^d $; $m^0_{0,i}=\hat v^0_{0,i}=v^0_{0,i} = 0$, $\forall i\in \llbracket n\rrbracket$; $\bar{\theta}_0 =  \frac{1}{n} \sum_{i=1}^n \theta_{0,i}$; $\hat v_0=\epsilon$
			
			\For{$r=1$ to $R$}
			\State Sample a set of clients $D^r$
			\For{parallel for device $i \in D^{r}$}
			\State Set $\theta_{r,i}^{0} = \bar{\theta}_{r-1}$,\ \ $m^{0}_{r,i} = m^T_{r-1,i}$\ ,\ \ $v^{0}_{r,i} = \hat{v}_{r-1}$
			\For{$t=1$ to $T$}
			\State Sample a mini-batch from the local data
			\State Compute stochastic gradient $g^t_{r,i}$ at $\theta_{r,i}^{t-1}$
			\State $m^t_{r,i} = \beta_1 m^{t-1}_{r,i} + (1 - \beta_1) g^t_{r,i}$
			% \State $m^{t}_{r,i}=m^{t}_{r,i} /\left(1-\beta_{1}^{t}\right)$ \label{line:new1}
			\State \colorbox{blue!20!white}{$v^{t}_{r,i} = \beta_2 v^{t}_{r-1,i} + (1 - \beta_2) (g^t_{r,i})^2$ }
			% \State $v^{t}_{r,i}=v^{t}_{r,i} /\left(1-\beta_{2}^{t}\right)$ \label{line:new2}
			\State Compute the ratio  $\psi_{r,i}^t=m^{t}_{r,i}/\sqrt{\hat v_{r-1}}$. \label{line:scale}
			
			\State \label{line:layer} Update local model for each layer $\ell \in \llbracket \tot \rrbracket$: 
			\begin{equation}\label{eq:updatelayer}
			\theta_{r,i}^{\ell,t}=\theta_{r,i}^{\ell,t-1}-\frac{\alpha_{r}\phi(\|\theta_{r,i}^{\ell,t-1}\|)(\psi_{r,i}^{\ell,t}+\lambda \theta_{r,i}^{\ell,t-1})}{\|\psi_{r,i}^{\ell,t}+\lambda \theta_{r,i}^{\ell,t-1}\|} 
			\end{equation} 
			\EndFor
			\State Communicate $\theta_{r,i}^{T} = [\theta_{r,i}^{\ell,T}]_{\ell =1}^{\tot}$ to server
			
			
			\State \colorbox{blue!20!white}{Communicate $v_{r,i}^T$ to server}
			
			\State \colorbox{red!20!white}{Communicate $\nabla f_i(\bar\theta_{r-1})$ using full local data}
			
			\EndFor
			
			\State Server compute $\bar{\theta}_r = \frac{1}{|D^{r}|} \sum_{i \in D^{r}} \theta_{r,i}^{T}$
			
			
			\State \colorbox{blue!20!white}{Server compute $\hat{v}_{r} = \max( \hat{v}_{r-1},\frac{1}{|D^{r}|} \sum_{i \in D^{r}} v^T_{r,i} )$}
			
			\State \colorbox{red!20!white}{Compute $\nabla f(\bar \theta_{r-1})=\frac{1}{|D^r|}\sum_{i\in D_r}\nabla f_i(\bar \theta_{r-1})$}
			
			\State \colorbox{red!20!white}{Compute $v_r = \beta_2 v_{r-1}+(1-\beta_2)\nabla f(\bar \theta_{r-1})^2)$}
			
			\State \colorbox{red!20!white}{Update $\hat v_{r}=\max(\hat v_{r-1},v_r$)}
			
			\EndFor
		\end{algorithmic}
		
	\end{algorithm}
	
	
	\vspace{0.05in}
	\noindent\textbf{Algorithm.} In general, our proposed algorithm can be viewed as a novel extension of LAMB to the more complicated federated learning setting. Based on the two recent works regarding locally adaptive FL mentioned above, we present the framework with two instances, Fed-LAMB and Mime-LAMB, as summarized in Algorithm~\ref{alg:ldams} and depicted in Figure~\ref{fig:illustrate}. We differentiate the steps of these two methods by blue\colorbox{blue!20!white}{(Fed-LAMB)}and red\colorbox{red!20!white}{(Mime-LAMB)}boxes surrounding the text. Both methods use layer-wise adaptive LAMB for local updates (Line~13). The update in \eqref{eq:updatelayer} on local workers can be expressed as
	\begin{align*}
	\theta \leftarrow \theta-\alpha\frac{\phi(\|\theta\|)}{\|\psi+\lambda\theta\|}(\psi+\lambda\theta),
	\end{align*}
	where $\phi(\cdot): \mathbb R_+ \mapsto \mathbb R_+$ is a scaling function (usually chosen to be the identity function in practice) and $\lambda$ is the weight decay rate. This way, the gradients are effectively normalized by the magnitude of layer weights, forcing the model move sufficiently far at every layer. Such normalization effect may accelerate the convergence of the model.
	
	
	The main difference between the two variants, Fed-LAMB and Mime-LAMB, is the way the second moment $\hat v$ is synchronized, i.e., the dimension-wise adaptive learning rate. 
	Both methods maintain a global $\hat v$ at the central server:
	\begin{itemize}
		\item \colorbox{blue!20!white}{Fed-LAMB (Line 20)}: at the end of each round, the $i$-th client communicates the local $v_{i}$; the server updates the global $\hat v$ by the max operation with the averaged $v$ among all clients, and sends back the global $\hat v$.
		
		\item \colorbox{red!20!white}{Mime-LAMB (Line 21-23)}: in each round $r$, the client computes and transmits the gradient at the global model $\bar\theta_r$ using full local data; the server updates the global $v$ and $\hat v$ in the same manner as AMSGrad. 
	\end{itemize}
	
	When implementing the algorithms, note that in Mime-LAMB, the global $v$ is directly calculated using full-batches (averaged over all clients). As a result, Mime-LAMB needs to calculate the gradients twice, leading to twice the computational cost as Fed-LAMB. We also note that, the local update of Fed-LAMB (Line 13 of Algorithm~\ref{alg:ldams}) also incorporates the ``decoupled ''weight decay, which is same as the weight decay mechanism used in the AdamW algorithm~\citep{loshchilov2019decoupled}.
	
	
	\vspace{0.05in}
	\noindent\textbf{Data Heterogeneity:} Conceptually, both the two approaches aim at alleviating the impact of data heterogeneity by globally reconciling the adaptive learning rates. We call this ``moment sharing''. Therefore, in some sense, Algorithm~\ref{alg:ldams} is naturally capable of balancing the heterogeneity in different local data distributions. Indeed, in ~\citep{chen2020toward} and \citep{karimireddy2020mime}, the authors have shown that Fed-AMS and Mime would perform much worse, or even diverge, without aggregating and sharing the second moment $\hat v$ (please refer to the papers for details). Intuitively, synchronizing $\hat v$ makes all the clients “on the same pace” which is crucial for the convergence of locally adaptive FL methods. 
	
	\vspace{0.05in}
	\noindent\textbf{Extension: skip synchronization of $\hat v_t$.}\ In practice, when trained with the same number of rounds $R$ and local iterations $T$, Mime, Fed-LAMB and Fed-Mime all require communicating two tensors (the local model update, and second moment $v$), while Fed-SGD~\citep{mcmahan2017communication} and Adp-Fed~\citep{reddi2021adaptive} only communicate one local update tensor. Hence, locally adaptive methods in general tend to require more communication. We now discuss a simple implementation trick of our algorithm that reduces this extra cost. Note that, as long as $\hat v_t$ is consistent across clients, we may not need to update and broadcast it in every round. To reduce the extra communication overhead of transmitting $\hat v$, one trick is to reduce the aggregation frequency of $\hat v$ in Algorithm~\ref{alg:ldams} (e.g., we synchronize $\hat v$ every $Z$ rounds). It can be shown that this ``skip'' aggregation of the second moment does not affect the convergence rate of our Fed-LAMB (see Theorem~\ref{th:multiple update}). Yet it can effectively reduce the communication of $\hat v$ by a factor of $Z$, which to a great extent alleviates the extra communication cost of locally adaptive methods. We will also show empirical evidence of this strategy in our experiments.
	
	
	
	\section{Theoretical Analysis}\label{sec:theory}
	
	
	In the context of nonconvex stochastic optimization for federated learning, we will make the following standard analytical assumptions.
	
	\begin{assumption}[Smoothness]\label{ass:smooth}
		For all $i \in \inter$ and $\ell \in \llbracket \tot \rrbracket$, the local loss function is $L_\ell$-smooth: $\norm{\nabla f_i (\theta^\ell) - \nabla f_i (\vartheta^\ell)} \leq L_\ell \norm{\theta^\ell-\vartheta^\ell}$.
	\end{assumption}
	
	\begin{assumption}[Unbiased and bounded gradient]\label{ass:boundgrad}
		The stochastic gradient is unbiased for $\forall r,t,i$: $\EE[g_{r,i}^t] = \nabla f_i(\theta_r^t)$ and bounded by $\norm{g_{r,i}^t} \leq M$.
	\end{assumption}
	
	\begin{assumption}[Bounded variance]\label{ass:var}
		The stochastic gradient admits (\emph{locally}) $\EE[|g_{r,i}^j - \nabla f_i(\theta_r)^j|^2] \leq \sigma^2$, and (\emph{globally}) $ \frac{1}{n} \sum_{i=1}^n ||\nabla f_{i}(\theta_r) - \nabla f(\theta_r)||^2] \leq G^2$.
	\end{assumption}
	
	Assumption~\ref{ass:smooth} and Assumption~\ref{ass:boundgrad} are commonly used in the analysis of adaptive gradients methods~\citep{reddi2019convergence,chen2019convergence,reddi2021adaptive,karimireddy2020mime}. Assumption \ref{ass:var} characterizes the data heterogeneity among local devices, and $G = 0$ when local data are IID. 
	
	Same as in~\citep{you2019large}, we further make the following assumption on the scaling function $\phi$.
	
	\begin{assumption}[Bounded scaling function] \label{ass:phi}
		For any $a > 0$, there exist $\phi_m>0,\phi_M>0$ such that $\phi_m \leq  \phi(a) \leq \phi_M$.
	\end{assumption}
	
	Assumption~\ref{ass:phi} can be satisfied when, for example, we let $\phi(a)=\min\{ a+\zeta, \phi_M \}$ be the identity map plus a small constant $\zeta$ with an upper clipping threshold at some $\phi_M$.
	
	We now state our main result regarding the convergence rate of the proposed Algorithm~\ref{alg:ldams}. 
	
	\begin{theo}\label{th:multiple update}
		Under Assumption \ref{ass:smooth}-Assumption \ref{ass:phi}, consider $\{\overline{\theta_r}\}_{r>0}$ obtained from Algorithm~\ref{alg:ldams} with a constant learning rate $\alpha$. Suppose $\lambda = 0$. Then the squared gradient of the global model uniformly chosen from round $1,..., R$ is bounded by
		\begin{align} 
		&\frac{1}{R}\sum_{r=1}^R  \EE\left[ \left\| \frac{\nabla f(\overline{\theta_r})}{\hat v_r^{1/4}}   \right \|^2 \right] \nonumber\\
		&\leq    \sqrt{\frac{M^2 p}{n}}  \frac{ \triangle}{\tot \alpha R}+\frac{4 \alpha^2 \overline{L} M^2 T^2 \phi_M^2 (1-\beta_2)p}{\sqrt{\epsilon}} \notag \\
		&+4\alpha^2 \frac{M^2}{\sqrt{\epsilon}} +      \frac{\phi_M   \sigma^2}{R n} \sqrt{\frac{1 - \beta_2}{M^2 p}  } + 4\alpha^2 \left[ \phi_M^2\sqrt{M^2+p\sigma^2} \right]     \notag\\
		& +4  \frac{\alpha^2 \overline{L}}{\sqrt{\epsilon}}  M^2 T^2 G^2 (1-\beta_2)p +4\alpha \left[\phi_M \frac{\tot \sigma^2}{\sqrt{n}}\right], \label{bound1multiple}
		\end{align}
		where $\triangle=\EE[f(\bar{\theta}_1)]  - \min \limits_{\theta \in \Theta} f(\theta)$ and $\overline{L}=\sum_{\ell=1}^\tot L_\ell$.
	\end{theo}
	\begin{remark}
		Theorem~\ref{th:multiple update} applies to both Fed-LAMB and Mime-LAMB variants. Also, the manifestation of $p$ in the rate is because the variance bound is assumed on each dimension in Assumption \ref{ass:var}. This dependency on $p$ can be removed when Assumption \ref{ass:var} is assumed globally, which is also common in optimization literature. Moreover, this result also holds for Algorithm~\ref{alg:ldams} with skip synchronization of $\hat v_t$ as discussed earlier.
	\end{remark}
	
	
	Using the uniform boundedness of the second moment accumulator $\|\hat v_r \|$ (which can be shown by Assumption~\ref{ass:boundgrad}) and by choosing a suitable decreasing learning rate, we have the following simplified statement.
	
	\begin{coro}\label{coro:main}
		Under the same setting as Theorem~\ref{th:multiple update}, with $\alpha = \mathcal{O}(\frac{1}{ \sqrt{ \tot R}})$, it holds that
		\vspace{-0.1in}
		\begin{align} \label{coro:rate}
		&\hspace{0.2in} \frac{1}{R}\sum_{r=1}^R  \EE\left[ \left\| \nabla f(\overline{\theta_r})   \right \|^2 \right] \nonumber\\
		&\hspace{0.6in} \leq \mathcal{O}\left( \frac{\sqrt p}{\sqrt{n\tot R}}+\frac{\sqrt\tot \sigma^2 }{\sqrt{nR}}  + \frac{G^2T^2p}{R\tot}\right).
		\end{align}
	\end{coro}
	
	
	The leading two terms display a dependence of the convergence rate of Fed-LAMB on the initialization and the local variance of the stochastic gradients (Assumption \ref{ass:var}). The last term involves the number of local updates $T$, and the global variance $G^2$ characterizing the data heterogeneity. Next, we provide detailed discussion and comparison of our result to related prior works.
	
	\vspace{0.05in}
	
	\noindent\textbf{LAMB bound in~\citep{you2019large}: }
	We start our discussion with the comparison of our convergence rate with that of LAMB, Theorem 3 in~\citep{you2019large}. In the single-machine setting, the convergence rate of LAMB is $\mathcal O(\sqrt{p}{\sqrt{\tot T}})$ where $T$ is the number of training iterations. Note the convergence rate of Fed-LAMB is different from that of LAMB in the sense that, the convergence criterion is given at the averaged parameters (global model) at the end of each round. In Corollary~\ref{coro:main}, our rate would match LAMB if we take number of local step $T=1$. This is also true for any fixed $T$ and $R$ sufficiently large. In addition, the $\mathcal O(\frac{1}{\sqrt{nR}})$ rate of Fed-LAMB implies an important \textit{linear speedup} effect: the number of iterations to reach a $\delta$-stationary point of Fed-LAMB decreases linearly in $n$, which displays the merit of distributed (federated) learning. 
	
	
	\vspace{0.05in}
	
	\noindent\textbf{Fed-AMS bound in~\citep{chen2020toward}: }
	We now compare our method theoretically with Fed-AMS, the baseline distributed adaptive method developed in~\citep{chen2020toward}. Their results state that when $T\leq \mathcal O(R^{1/3})$, the convergence rate of Fed-AMS is $\mathcal O(\frac{1}{\sqrt{nR}})$. Firstly, when the number of rounds $R$ is sufficiently large, both our rate~\eqref{coro:rate} and the rate of Fed-AMS are dominated by $\mathcal O(\frac{1}{\sqrt{n R}})$, improving the convergence rate of the standard AMSGrad, e.g.~\citep{zhou2018convergence} by $\mathcal O(1/\sqrt n)$ (i.e., linear speedup). Secondly, in~\eqref{coro:rate}, the last term containing the number of local updates $T$ is small as long as $T^4\leq \mathcal O(\frac{Rh}{G^2})$. If we further assume $h\simeq T$, then we get the same rate of convergence as Fed-AMS with $T\leq \mathcal{O}(R^{1/3})$ local iterations, identical to the condition of Fed-AMS. Under these analytic settings and conditions , the convergence rate of Fed-LAMB also matches many popular federated learning methods in nonconvex optimization, e.g., Fed-SGD~\citep{mcmahan2017communication}, Mime~\citep{karimireddy2020mime} and Adp-Fed~\citep{reddi2021adaptive}. Moreover, when $G$ is small (less data heterogeneity), the bound on $T$ would increase, i.e., we can conduct more local updates. This is intuitive, for example, when $G=0$ in the IID data setting, $T$ can be very large.
	
	As a brief summary, Fed-LAMB achieves the same asymptotic convergence rate as Fed-AMS in the federated (distributed) learning setting. Our method also exhibits the favorable linear speedup property regarding the number of clients in the system. Next, we will show that Fed-LAMB and its variants provide impressive acceleration empirically in our experimental study presented next.
	
	
	
	\section{Experiments}\label{sec:numerical}
	
	In this section, we conduct experiments on benchmark datasets with various network architectures to justify the effectiveness of our proposed method in practice. Our method empirically confirms its merit in terms of convergence speed. Basically, Fed-LAMB and Mime-LAMB reduce the number of rounds and thus the communication cost required to achieve a similar stationary point (or test accuracy) than the baseline methods. 
	In many cases, Fed-LAMB also brings notable improvement in generalization over baselines.
	
	
	\noindent\textbf{Methods.} We evaluate the following five FL algorithms, mainly focusing on recent federated optimization approaches based on adaptive gradient methods: 
	\begin{enumerate}
		\item Fed-SGD~\citep{mcmahan2017communication}, standard federated averaging with local SGD updates.
		
		\item Adp-Fed (\emph{Adaptive Federated Optimization}, see Appendix for more details), the federated adaptive algorithm proposed by~\citep{reddi2021adaptive}. Adp-Fed performs local SGD updates. In each round $r$, the changes in local models, $\triangle_i=w_{r,i}^T-w_{r,i}^0$, $i=1,...,n$, are sent to the central server for an aggregated Adam update. 
		
		\item Fed-AMS~\citep{chen2020toward}, locally adaptive AMSGrad algorithm.
		
		\item Mime~\citep{karimireddy2020mime} with AMSGrad, which performs adaptive local updates with central-server-guided global adaptive learning rate.
		
		\item Our proposed Fed-LAMB and Mime-LAMB methods (Algorithm~\ref{alg:ldams}).
		
	\end{enumerate}
	For all the adaptive gradient methods, we set $\beta_1=0.9$,~$\beta_2=0.999$ by default~\citep{reddi2019convergence}. We present the results of $n=50$ clients with $0.5$ participation rate, i.e., we randomly pick half of the clients to be active for training in each round, and the local mini-batch size is set as 128. In each round, the training samples are allocated to the active devices, and one local epoch is completed after all the local devices run one pass over their received samples via mini-batch training. Results with more clients can be found in the Appendix, which give the same conclusions as what we will present below.
	
	We tune the learning rate $\alpha$ for each algorithm over a fine grid. For Adp-Fed, there are two learning rates involved (global and local), both of which are tuned. More tuning details can be found in the Appendix. For Fed-LAMB and Mime-LAMB, the weight decay rate $\lambda$ is tuned from $\{0,0.01,0.1\}$, and $\phi(x)=x$ is the identity mapping. For each run, we report the best test accuracy. The results are averaged over 3 runs each from a same initialization point.
	
	
	\vspace{0.05in}
	\noindent\textbf{Datasets and models.} We experiment with four popular benchmark image classification datasets: MNIST~\citep{lecun1998mnist}, Fashion MNIST (FMNIST)~\citep{xiao2017fashion}, CIFAR-10~\citep{krizhevsky2009learning} and TinyImageNet~\citep{deng2009imagenet}. For MNIST, we apply 1) a simple multi-layer perceptron (MLP), which has one hidden layer containg 200 cells; 2) Convolutional Neural Network (CNN), which has two max-pooled convolutional layers followed by a dropout layer and two fully-connected layers with 320 and 50 cells respectively. This CNN is also implemented for FMNIST. 
	For CIFAR-10 and TinyImageNet, we use ResNet-18 ~\citep{he2016deep}.
	
	
	
	\subsection{Comparison under IID settings}
	
	In Figure~\ref{fig:iid}, we report the test accuracy of MLP trained on MNIST, and of CNN trained on MNIST and FMNIST, where the data are IID allocated among the clients. We test 1 local epoch and 3 local epochs. In all the figures, we observe a clear advantage of Fed-LAMB over the competing methods in terms of the convergence speed. In particular, we can see that Fed-LAMB is able to achieve the same accuracy with fewest number of communication rounds, thus improving the model training efficiency. For instance, this can be observed as follows: on MNIST + CNN (1 local epoch), Fed-AMS requires 20 rounds to achieve 90\% accuracy, while Fed-LAMB only takes 5 rounds. This implies a 75\% reduction in the communication cost and training time. Moreover, on MNIST, Fed-LAMB also leads to improved generalisation performance, i.e., test accuracy. We can draw same conclusions with 3 local epochs. Also, similar comparison holds for Mime-LAMB vs. Mime. In general, the Mime-LAMB and Fed-LAMB perform similarly. 
	
	
	\begin{figure}[t]
		
		\begin{center}
			\mbox{\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/mnist_testerror_mlp_ep1_iid1_mime.pdf}\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/mnist_testerror_mlp_ep3_iid1_mime.pdf}
			}
			\mbox{\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/mnist_testerror_cnn_ep1_iid1_mime.pdf}\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/mnist_testerror_cnn_ep3_iid1_mime.pdf}
			}
			\mbox{\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/fmnist_testerror_cnn_ep1_iid1_mime.pdf}\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/fmnist_testerror_cnn_ep3_iid1_mime.pdf}
			}
		\end{center}
		\caption{\textbf{IID data setting}. Test accuracy against the number of communication rounds. 
		}
		\label{fig:iid}
		\vspace{-0.1in}
	\end{figure}
	
	\begin{figure}[t]
		
		\begin{center}
			\mbox{
				\hspace{-0.15in}\includegraphics[width=1.83in]{figure_final/mnist_testerror_mlp_ep1_iid0_mime.pdf}\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/mnist_testerror_mlp_ep3_iid0_mime.pdf}
			}
			\mbox{\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/mnist_testerror_cnn_ep1_iid0_mime.pdf}
				\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/mnist_testerror_cnn_ep3_iid0_mime.pdf}
			}
			\mbox{\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/fmnist_testerror_cnn_ep1_iid0_mime.pdf}\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/fmnist_testerror_cnn_ep3_iid0_mime.pdf}
			}
		\end{center}
		\caption{\textbf{non-IID data setting.} Test accuracy against the number of communication rounds.}
		\label{fig:noniid}	
		\vspace{-0.1in}
	\end{figure}
	
	
	\begin{table*}[h]
		\centering
		\resizebox{2\columnwidth}{!}{%
			\begin{tabular}{c|cccccc}
				\toprule[1pt]
				& Fed-SGD    & Adp-Fed    & Fed-AMS    & \textbf{Fed-LAMB }   & Mime & \textbf{Mime-LAMB}        \\ \hline
				CIFAR-10 & 90.75 $\pm$ 0.48  &91.57 $\pm$ 0.38  & 90.93 $\pm$ 0.22 &  \textbf{92.44 $\pm$ 0.53} & 90.94 $\pm$ 0.13 & \textbf{92.00 $\pm$ 0.21}  \\
				TinyImageNet & 67.58 $\pm$ 0.21  &  74.17 $\pm$ 0.43   & 64.86 $\pm$ 0.83& \textbf{76.00 $\pm$ 0.26} & 67.82 $\pm$ 0.24 & \textbf{73.46 $\pm$ 0.25} \\
				\toprule[1pt]
			\end{tabular}
		}
		\caption{Test accuracy with ResNet-18 network after 100 communication rounds.}
		\label{tab:acc}
	\end{table*}
	
	
	
	\subsection{Comparison under non-IID settings}
	
	
	
	
	
	% \begin{figure}[h]
	%     \begin{center}
	%         \mbox{\hspace{-0.1in}
	%         \includegraphics[width=1.97in]{figure_final/cifar_testerror_resnet18_ep1_client2_iid0_mime.pdf}
	%         \hspace{-0.15in}
	%         \includegraphics[width=1.97in]{figure_final/tinyimagenet_testerror_resnet18_ep1_client2_iid0_mime.pdf}\hspace{-0.1in}
	%         % \includegraphics[width=0.4\textwidth]{new_figure/tinyimagenet_testerror_resnet18_ep1_client2_iid0.eps}
	%         }
	%     \end{center}
	% 	\caption{\textbf{non-IID data setting.} Test accuracy of CIFAR-10 and TinyImagenet on ResNet-18.
	% 	}
	% 	\label{fig:noniidresnet18}\vspace{-0.1in}
	% \end{figure}
	
	
	\begin{figure}[h]
		
		\begin{center}
			\mbox{\hspace{-0.15in}
				\includegraphics[width=1.83in]{figure_final/cifar_testerror_resnet18_ep1_client2_iid0_mime.pdf}
				\hspace{-0.15in}
				\includegraphics[width=1.87in]{figure_final/tinyimagenet_testerror_resnet18_ep1_client2_iid0_mime.pdf}
				% \includegraphics[width=0.4\textwidth]{new_figure/tinyimagenet_testerror_resnet18_ep1_client2_iid0.eps}
			}
		\end{center}
		\caption{\textbf{non-IID data.} Test accuracy of CIFAR-10 and TinyImagenet on ResNet-18.
		}
		\label{fig:noniidresnet18}
	\end{figure}
	
	\begin{figure}[h]
		\begin{center}
			\mbox{\hspace{-0.15in}
				\includegraphics[width=1.85in]{figure_final/mnist_testerror_cnn_ep1_client50_lazyv35_iid0.pdf}\hspace{-0.15in}
				\includegraphics[width=1.85in]{figure_final/fmnist_testerror_cnn_ep1_client50_lazyv35_iid0.pdf}
			}
		\end{center}
		\caption{\textbf{non-IID data.} Fed-LAMB and Mime-Fed with skip synchronization of $\hat v$: the global $\hat v_t$ is synchronized every $Z=3$ or $5$ rounds.
		}
		\label{fig:lazy}
	\end{figure}
	
	
	In Figure~\ref{fig:noniid}, we provide the results on  MNIST and FMNIST with non-IID local data distribution. In particular, in each round of federated training, every local device only receives samples from one or two classes (out of ten). We see that for experiments with 1 local epoch, in all cases our proposed Fed-LAMB outperforms all the baseline methods. Similar to the IID data setting, Fed-LAMB provides faster convergence speed and achieves higher test accuracy than Fed-SGD and Fed-AMS. The advantage is especially significant for the CNN model, e.g., it improves the accuracy of Fed-SGD and Fed-AMS by more than 10\% on FMNIST at the 50-th round. The other baseline method, Adp-Fed, performs as good as our Fed-LAMB on FMNIST, but worse than other methods on MNIST. Mime-LAMB also considerably improves Mime, in all the runs, see Figure~\ref{fig:noniid}.
	
	The relative comparison is basically the same for 3 local epochs, but the advantage of Fed-LAMB becomes less significant than what we observed in Figure~\ref{fig:iid} with IID data. One plausible reason is that when the local data is highly non-IID.
	Intuitively, with more local steps, learning the local models fast might not always do good to the global model, as local models target at different loss functions.
	
	
	In Figure~\ref{fig:noniidresnet18}, we present the results on CIFAR-10 and TinyImageNet datasets trained by ResNet-18. When training these two models, we decrease the learning rate to $1/10$ at the 30-th and 70-th communication round. From Figure~\ref{fig:noniidresnet18}, we can draw similar conclusion as before: the proposed Fed-LAMB is the best method in terms of both convergence speed and generalization accuracy. In particular, on TinyImageNet, we see that Fed-LAMB has a significant advantage over all the four baselines without layer-wise acceleration. Although Adp-Fed performs better than Fed-SGD and Fed-AMS, it is considerably worse than Fed-LAMB. We report the test accuracy at the end of training in Table~\ref{tab:acc}. Fed-LAMB achieves the highest accuracy on both datasets. Mime-LAMB also substantially improves Mime. 
	
	\vspace{0.05in}
	\noindent\textbf{Skip synchronization}. In Figure~\ref{fig:lazy}, we further present the results of methods with skip synchronization of $\hat v$, where the server updates and broadcasts $\hat v$ every $Z=3,5$ rounds, instead of in very single round. This reduces the communication cost of transmitting the second moment $v$ by a factor of 3 or 5. We see that, the empirical performance of skip synchronization is similar to the standard design; some times it may converge even faster. Our results demonstrate the efficacy of this more efficient strategy in practice.
	
	
	
	\subsection{Summary of empirical findings}
	
	Here, we provide a brief summary of our empirical results. On all the datasets, in terms of both convergence and generalization, the primary comparisons between our proposed methods and their baselines appear evident:
	\begin{align*}
	\textbf{Fed-LAMB$\approx$ Mime-LAMB$>$Fed-AMS$\approx$Mime.} 
	\end{align*}
	The proposed scheme (with two variants Fed-LAMB and Mime-LAMB) exhibits faster convergence and better generalisation accuracy than recently proposed adaptive FL algorithms. Our results suggest that, using layer-wise acceleration in the local training can speedup the overall model performance of locally adaptive federated learning. Moreover, in practice we may adopt the skip aggregation strategy to further reduce the additional communication required for our proposed approach, without losing utility. As discussed earlier, Mime-LAMB typically requires more gradient computation than Fed-LAMB. Therefore, with similar performance as Mime-LAMB, the Fed-LAMB protocol might be more efficient and convenient in practical applications.
	
	
	%{\color{red} 
	%\subsection{Take-away from the experiments}
	%From an empirical point of view, we 
	%} 
	
	
	\section{Conclusion}\label{sec:conclusion}
	
	We study a doubly adaptive method in the particular framework of federated learning (FL). Built upon the acceleration effect of layer-wise learning rate scheduling and of state-of-the-art adaptive gradient methods, we derive a locally layer-wise FL framework that performs local updates using adaptive AMSGrad on each worker and periodically averages local models stored on each device. 
	The core of our Fed-LAMB scheme, is to speedup up local training by adopting layer-wise adaptive
	learning rates. To our knowledge, this is the first FL algorithm in literature that possess both the  \emph{dimension-wise} adaptivity (by AMSGrad) and \emph{layer-wise} adaptivity (by layer-wise adjusted learning rate). 
	
	We provide the convergence analysis of Fed-LAMB that matches many existing methods, with a linear speedup against the number of clients. We also provide a skip aggregation trick to further reduce the communication overhead. Extensive experiments on various datasets and models, under both IID and non-IID data settings, validate that both Fed-LAMB and Mime-LAMB are able to provide faster convergence which in turn leads to reduced communication and training time to reach a certain accuracy. In many cases, our framework also improves the overall performance~of~federated~learning~over~prior~methods. 
	
	Adaptive FL (at central server) with communication compression has been studied in~\citep{li2022distributed,li2023analysis}. In the future, we may also study Fed-LAMB type locally adaptive algorithms with communication compression.
	
	
	\newpage
	\clearpage
	
	\bibliographystyle{plainnat}
	\bibliography{karimi_320}
	
	
\end{document}
