\documentclass[accepted]{uai2022} 
%
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage[round]{natbib} % has a nice set of citation styles and commands
    % \bibliographystyle{plainnat}
    \bibliographystyle{uai_ref}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
% \usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{mathtools}
\usepackage{tablefootnote}
\usepackage{hhline}
\usepackage{dirtytalk}
\usepackage{float}

\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality s
% \usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}


\usepackage{array}
\usepackage{amssymb}% http://ctan.org/pkg/amssymb
\usepackage{pifont}
\usepackage{empheq}

% \usepackage[dvipsnames]{xcolor}
% \PassOptionsToPackage{dvipsnames}{xcolor}

% \usepackage[pagebackref=true]{hyperref}
% \PassOptionsToPackage{pagebackref=true}{hyperref}
% \usepackage{cleverref}
\definecolor{blue(pigment)}{rgb}{0.2, 0.2, 0.6}
\definecolor{yaleblue}{rgb}{0.06, 0.3, 0.57}
\definecolor{mediumblue}{rgb}{0.0, 0.0, 0.8}
\definecolor{vegasgold}{rgb}{0.77, 0.7, 0.35}
\usepackage[hyperpageref]{backref}
\hypersetup{
    % pagebackref=true,
    colorlinks=true,
    citecolor=vegasgold,%brown,
    linkcolor=mediumblue,%blue, % red
    filecolor=magenta,      
    urlcolor=blue(pigment),
} 

\usepackage{soul}
% \definecolor{myred}{rgb}{1,0.5,0}
\definecolor{myred}{rgb}{0.9333,0.8039,0.7922}
\sethlcolor{myred}

\newcommand{\hlfancy}[2]{\sethlcolor{#1}\hl{#2}}
\newcommand{\mathcolorbox}[2]{\colorbox{#1}{$#2$}}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
% \newcommand{\squeeze}{\textstyle} % when deployed
\newcommand{\squeeze}{}  % when not in use

\renewcommand*{\backrefalt}[4]{%
    \ifcase #1 \footnotesize{(Not cited.)}%
    \or        \footnotesize{(Cited on page~#2)}%
    \else      \footnotesize{(Cited on pages~#2)}%
    \fi}

\newcommand{\PreserveBackslash}[1]{\let\temp=\\#1\let\\=\temp}
\newcolumntype{C}[1]{>{\PreserveBackslash\centering}p{#1}}
\newcolumntype{R}[1]{>{\PreserveBackslash\raggedleft}p{#1}}
\newcolumntype{L}[1]{>{\PreserveBackslash\raggedright}p{#1}}

\input{commands.tex}

\title{Shifted Compression Framework: Generalizations and Improvements}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
% 
% Important: in case of equal contributions, we strongly recommend to NOT show it in this part of the paper, but rather describe it in the appropriate section at the end of the paper "Author Contribution", where you have more space to describe how each author contributed.
%
% Add authors
% Remember to use the order convention "First/Given name" "Last/Family name", e.g. John Smith, Hanako Yamada, Marco Rossi, Wei Zhang
% \href{mailto:<egor.shulgin@kaust.edu.sa>?Subject=Shifted Compression UAI 2022 paper
\author[1]{\href{https://shulgin-egor.github.io}{Egor Shulgin}{}}
\author[1]{Peter Richtárik}
% Add affiliations after the authors
\affil[1]{%
    King Abdullah University of Science and Technology (KAUST)\\
    Thuwal, Saudi Arabia
}
  
  \begin{document}
\maketitle

\begin{abstract}
  Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models, and lossy compression of exchanged information, such as stochastic gradients or models, is one of the most effective instruments to alleviate this issue. Among the most studied compression techniques is the class of unbiased compression operators with variance bounded by a multiple of the square norm of the vector we wish to compress. By design, this variance may remain high, and only diminishes if the input vector approaches zero. However, unless the model being trained is overparameterized, there is no a-priori reason for the vectors we wish to compress to approach zero during the iterations of classical methods such as distributed compressed {\sf SGD}, which has adverse effects on the convergence speed. Due to this issue, several more elaborate and seemingly very different algorithms have been proposed recently, with the goal of circumventing this issue. These methods are based on the idea of compressing the {\em difference} between the vector we would normally wish to compress and some auxiliary vector that changes throughout the iterative process. In this work we take a step back, and develop a unified framework for studying such methods, both conceptually and theoretically. Our framework incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors. Furthermore, our general framework can lead to the improvement of several existing algorithms, and can produce new algorithms. 
  Finally, we performed several numerical experiments to illustrate and support our theoretical findings.
\end{abstract}

\section{INTRODUCTION}\label{sec:intro}
We consider the distributed optimization problem
\begin{equation} \label{eq:problem}
    \min_{x \in \bR^d}~\sbr{f(x) \eqdef \frac{1}{n} \sum_{i=1}^n f_i(x)}, \tag{$\star$}
\end{equation}
where $n$ is the number of workers/clients and $f_i: \bR^d \to \bR$ is a smooth function representing the loss of the model parametrized by $x \in \bR^d$ for data stored on node $i$. This formulation % is mathematically equivalent to Empirical Risk Minimization (ERM) and 
has become very popular in recent years due to the increasing need for training large-scale machine learning models \citep{goyal2018accurate}.
% rapid increase of training data, which sometimes can not be stored on one machine. 
% Another notable application is Federated Learning \citep{mcmahan2017communication, konecny2017federated, kairouz2021advances}, where each node $i$ represents the client.
% \textbf{Communication bottleneck} 

\textbf{Communication bottleneck.} Compute nodes have to exchange information in a distributed learning process. The size of the sent messages (usually gradients or model updates) can be very large, which creates a significant bottleneck \citep{luo2018phub, peng2019generic, sapio2021switchml} to the whole training procedure. One of the main practical solutions to this problem is lossy \textit{communication} \textit{compression} \citep{seide20141, konecny2017federated, alistarh2017qsgd}. %of the communicated vectors. 
It suggests applying a (possibly randomized) mapping $\cC$ to a vector/matrix/tensor $x$ before it is transmitted in order to produce a less accurate estimate $\cC(x): \Rd \to \Rd$ and thus save bits sent per every communication round. 
% bits exchanged
% and  thus required for solving the problem.
% before transmitting estimates $\cQ(x)$ of the vector $x$ less accurate

% \textbf{Related Work}
% Compression, Variance reduction, compressed iterates

% \renewcommand{\thefootnote}{\roman{footnote}}
\iffalse
\begin{table*}[!h]
\setlength{\tabcolsep}{2pt}
% \small
\caption{%Summary of contributions. 
Iteration complexities are presented in $\tilde{\cO}$-notation to omit $\log 1/\eps$ factors and for the simplified case $\omega_i \equiv \omega, \delta_i \equiv \delta, L_i \equiv L$, $p_i \equiv p$. 
More refined results are in Theorems with links in the second column.
The last two rows refer to the methods with compressed iterates.
Complexities for \algname{DCGD-SHIFT} and \GDCI are in the interpolation regimes: $\nabla f_i(x^\star) = 0 = x^\star - \gamma \nabla f_i(x^\star)$.
}
% For simplicity of demonstration rates are presented for case $\omega_i \equiv \omega$, $L_i \equiv L$ and $p_i \equiv p$.
\centering
\setlength{\tabcolsep}{8pt}
\begin{center}
{\def\arraystretch{1.5}
\begin{tabular}{cccc} %\label{table:shifts}
% \hline
% \specialrule{1pt}{1pt}{1pt}
\textbf{ALGORITHM}  & \textbf{REF}  & \textbf{PREVIOUS}  &  \textbf{OUR RESULT}\\%\footnotemark\\ 
% \specialrule{1.5pt}{0.1pt}{0.1pt}
\hline
% \hhline{|====|}
\DCGD-\algname{SHIFT} & \ref{thm:fixed_shift} & $-$ & $\kappa \br{1 + \frac{\omega}{n}}$ \\
% \hline
% \algname{DCGD-SHIFT$^\star$}/\algname{STAR}
\algname{DCGD-STAR} & \ref{thm:optimal_shift} & $-$ & $\kappa \br{1 + \frac{\omega}{n}\br{1-\delta}}$ \\
% \hline
\DIANA & \ref{thm:diana_shift} & $\max\left\{\kappa\br{1 + \frac{\omega}{n}}, \omega \right\}$ & % \citep{mishchenko2019distributed}
$\max\left\{\kappa\br{1 + \frac{\omega}{n} {\color{cyan}\br{1 - \delta}}}, \omega {\color{cyan}\br{1 - \delta}} \right\}$ \\
% \hline
\RDIANA & \ref{thm:r_diana_shift} & $-$ & 
    $\max \left\{\kappa\br{1 + \frac{\omega}{n}\br{1-\delta}}, \frac{1}{p}\right\}$ \\
% \hhline{|===|}
\specialrule{0.2pt}{1pt}{1pt}
\GDCI & \ref{thm:gdci} & $\kappa^{\color{red}2} \br{1 + \frac{\omega}{n}}$ & $\kappa \br{1 + \frac{\omega}{n}}$ \\
% \hline
\algname{VR-GDCI}%~\eqref{thm:vr_gdci} 
& \ref{subsection:vr_gdci} & $\max\left\{\kappa^{\color{red}2} \br{1 + \frac{\omega}{n}}, \omega\right\}$ & % \citep{chraibi2019distributed}
$\max\left\{\kappa\br{1 + \frac{\omega}{n}}, \omega \right\}$ \\
% \hline
% \specialrule{1pt}{1pt}{1pt}
\end{tabular}
}
\end{center}\label{table:complexities}
\end{table*}
\fi

\begin{table*}[!h]
\setlength{\tabcolsep}{2pt}
% \small
\caption{%Summary of contributions. 
Overview of results for methods obtained as special cases of our general framework \algname{DCGD-SHIFT} (Alg. \ref{alg:dcgd_shift}).
Iteration complexities are presented in $\tilde{\cO}$-notation to omit $\log 1/\eps$ factors and for the simplified case $\omega_i \equiv \omega, \delta_i \equiv \delta, L_i \equiv L$, $p_i \equiv p$. 
More refined statements are in theorems with links in the last column.
% The last two rows refer to the methods with compressed iterates.
Complexities for \algname{DCGD-SHIFT} and \GDCI are shown in the interpolation regimes: $\nabla f_i(x^\star) = 0 = x^\star - \gamma \nabla f_i(x^\star)$.
}
% For simplicity of demonstration rates are presented for case $\omega_i \equiv \omega$, $L_i \equiv L$ and $p_i \equiv p$.
\centering
\setlength{\tabcolsep}{5pt}
\begin{center}
{\def\arraystretch{1.5}
\begin{tabular}{ccccc} %\label{table:shifts}
% \hline
% \specialrule{1pt}{1pt}{1pt}
\toprule
\textbf{Instance of \algname{DCGD-SHIFT}}  & \textbf{Shift}  & \textbf{Previous}  &  \textbf{Our result}   &  \textbf{Theorem} \\%\footnotemark\\ 
% \specialrule{1.5pt}{0.1pt}{0.1pt}
\hline
% \hhline{|====|}
% $\begin{array}{cc} \text{\algname{DCGD-FIXED}} \\ \text{New} \end{array}$
\algname{DCGD-FIXED} (this work) & \eqref{eq:fixed_shift} & $-$ & $\kappa \br{1 + \frac{\omega}{n}}$ & \ref{thm:fixed_shift}\\
% \hline
% \algname{DCGD-SHIFT$^\star$}/\algname{STAR}
\algname{DCGD-STAR} (this work) & \eqref{eq:opt_shift} & $-$ & $\kappa \br{1 + \frac{\omega}{n}\br{1-\delta}}$ & \ref{thm:optimal_shift}\\
% \hline
\DIANA \citep{mishchenko2019distributed} & \eqref{eq:diana_shift} & $\max\left\{\kappa\br{1 + \frac{\omega}{n}}, \omega \right\}$ & % \cite{mishchenko2019distributed}
$\max\left\{\kappa\br{1 + \frac{\omega}{n} {\color{cyan}\br{1 - \delta}}}, \omega {\color{cyan}\br{1 - \delta}} \right\}$ & \ref{thm:diana_shift} \\
% \hline
\RDIANA (this work) & \eqref{eq:r_diana_shift} & $-$ & 
    $\max \left\{\kappa\br{1 + \frac{\omega}{n}\br{1-\delta}}, \frac{1}{p}\right\}$ & \ref{thm:r_diana_shift} \\
% \hhline{|===|}
\specialrule{0.2pt}{1pt}{1pt}
\GDCI \citep{khaled2019gradient} & \eqref{eq:distributed_gdci} & $\kappa^{\color{red}2} \br{1 + \frac{\omega}{n}}$ & $\kappa \br{1 + \frac{\omega}{n}}$ & \ref{thm:gdci} \\
% \hline

% \algname{VR-GDCI} \citep{chraibi2019distributed}%~\eqref{thm:vr_gdci} 
% &  & $\max\left\{\kappa^{\color{red}2} \br{1 + \frac{\omega}{n}}, \omega\right\}$ & % \cite{chraibi2019distributed}
% $\max\left\{\kappa\br{1 + \frac{\omega}{n}}, \omega \right\}$ & \ref{thm:vr_gdci}\\

% \hline
% \specialrule{1pt}{1pt}{1pt}
\bottomrule
\end{tabular}
}
\end{center}\label{table:complexities}
\end{table*}

% \textbf{Short review of compressors}

\textbf{Compression operators.}
The topic of gradient compression in distributed learning has been studied extensively over the last years from both practical \citep{xu2020compressed} and theoretical \citep{beznosikov2020biased, safaryan2021uncertainty, albasyoni2020optimal} approaches. Compression operators are typically divided into two large groups: \textit{unbiased} and \textit{biased} operators. The first group includes methods based on some sort of rounding or \textit{quantization}: Random Dithering \citep{goodall1951television, roberts1962picture}, Ternary quantization \citep{wen2017terngrad}, Natural \citep{horvath2019natural}, and Integer \citep{mishchenko2021intsgd} compression. Another popular example is random \textit{sparsification} -- \texttt{Rand-K} \citep{wangni2018gradient, stich2018sparsified, konevcny2018randomized}, which preserves only a subset of the original vector coordinates. These two approaches can also be combined \citep{basu2019qsparse} for even more aggressive compression. There are also many other approaches based on low-rank approximation \citep{vogels2002powergossip, wang2018atomo, safaryan2021fednl}, vector quantization \citep{gandikota2021vqsgd}, etc. % on atomic decomposition \citep{wang2018atomo},
% There is also a line of work on 
The second group of biased compressors mainly includes greedy sparsification -- \texttt{Top-K} \citep{alistarh2018convergence, stich2018sparsified} and various sign-based quantization methods \citep{seide20141, bernstein2018signsgd, safaryan2021stochastic}. For a more complete review of compression operators, one can refer to the surveys by \citet{xu2020compressed} and \citet{beznosikov2020biased, safaryan2021uncertainty}.

\textbf{Optimization algorithms.}
% It is important to mention that 
Compression operators on their own are not sufficient for building a distributed learning system because they always go along with optimization algorithms. Distributed Compressed Gradient Descent (\DCGD) \citep{khirirat2018distributed} is one of the first theoretically analyzed methods which considered arbitrary unbiased compressors. The issue with \DCGD is that it was proven to converge linearly only to a neighborhood of the optimal point with constant step-size. \DIANA \citep{mishchenko2019distributed} fixed this problem by compressing specially designed gradient differences. Later \DIANA was generalized \citep{condat2021murana}, combined with variance reduction \citep{horvath2019stochastic}, accelerated \citep{li2020acceleration} in Nesterov's sense \citep{nesterov1983method} and by using smoothness matrices \citep{safaryan2021smoothness} with a properly designed sparsification technique.
% combined with biased compressors \citep{gorbunov2020linearly}.

On the other side are methods working with biased compressors, which require the use of the error-feedback ({\sf EF}) mechanism \citep{seide20141, alistarh2018convergence, stich2019error}. 
Such algorithms were often considered to be better in practice due to the smaller variance of biased updates \citep{beznosikov2020biased}. However, it was recently demonstrated that biased compressors can be incorporated into specially designed unbiased operators, and show superior to error-feedback results \citep{horvath2021better}. In addition, error-feedback was recently combined with the \DIANA trick \citep{gorbunov2020linearly}, which led to the first linearly converging method with {\sf EF}.
Later \cite{condat2022ef} proposed a unified framework for methods with biased and unbiased compressors.
% From algorithmic view these approaches are significantly different because 

\textbf{Compressed iterates.} %Alternative approach to 
Most of the existing literature (including all methods described above) focuses on compression of the gradients, while in applications like Federated Learning \citep{mcmahan2017communication, konecny2017federated, kairouz2021advances}, it is vital to reduce the size of the broadcasted model parameters \citep{reisizadeh2020fedpaq}. This demand gives rise to optimization algorithms with compressed iterates. The first attempt to analyze such methods was done by \cite{khaled2019gradient} for  Gradient Descent with Compressed iterates (\GDCI) in a single node set up. Later \GDCI was combined with variance-reduction for noise introduced by compression and generalized to a much more general setting of distributed fixed-point methods \citep{chraibi2019distributed}. % Instead of gradients model parameters can be compressed.

\textbf{Summary of contributions.} % Contributions
% \blacklozenge
The obtained results are summarized in Table~\ref{table:complexities}, with the improvements over previous works highlighted. The main contributions include:
% The main contributions of this work include:

% \textbf{1. Unified analysis.} 
\textbf{1. Generalizations of existing methods.}
We introduce the concept of a \textit{Shifted Compressor}, which generalizes a common definition of compression operators used in distributed learning. This technique allows to study various strategies for updating the shifts using both biased and unbiased compressors, to recover and improve such previously known methods as \DCGD and \DIANA. Additionally, as a byproduct, a new algorithm is obtained: \algname{DCGD-STAR}, which achieves linear convergence to the exact solution if we know the local gradients at the optimum. %While this algorithm is practically useful
% In addition we study 

\textbf{2. Improved rates.}
The notion of a shifted compressor allows us to revisit existing analysis of distributed methods with \textit{compressed iterates} and improve guarantees in both cases: with and without variance-reduction. Obtained results indicate that algorithms with model compression can have the same complexity as compressed gradient methods.

\textbf{3. New algorithm.}
We present a novel distributed algorithm with compression, called \algname{Randomized DIANA}, with linear convergence rate to the exact optimum. It has a significantly \textit{simpler analysis} than the original \DIANA method. Via examination of its experimental performance we highlight the cases when it can outperform \DIANA in practice.
% capable of learning the gradients in the optimum

% Obtained results are summarized in Table~\ref{table:complexities} with highlighted improvements over the previous works.
% In Table~\ref{table:complexities} we give a high level overview of existing methods for VIs, and contrast them with our methods and results. 
% are summarized in Table~\ref{table:complexities} and


% \footnotetext{Complexities for non variance-reduced methods are in the interpolation regimes: $\nabla f_i(x^\star) = 0 = x^\star - \gamma \nabla f_i(x^\star)$.}
% \footnotetext{Complexities for \algname{DCGD-SHIFT} and \GDCI are in the interpolation regimes: $\nabla f_i(x^\star) = 0 = x^\star - \gamma \nabla f_i(x^\star)$.}

\section{GENERAL FRAMEWORK} % General Framework

In this section we introduce compression operators and the framework of shifted compressors.

\subsection{Standard Compression}

At first recall some basic definitions.

% \egor{Where to include examples (Rand-K, quantization): here or earlier?}

\begin{definition}[General contractive compressor]  \label{def:general_compressor}
A (possibly) randomized mapping $\cC: \bR^d \to \bR^d$ is a {\bf compression operator} ($\cC \in \bB(\delta)$ for brevity) if for some $\delta \in (0, 1]$ and $\forall x \in \bR^d$
\begin{equation*} \squeeze
    \E{\norm{\cC(x) - x}^2} \leq (1-\delta) \norm{x}^2,
\end{equation*}
% for some $\delta \geq 1$. 
where the expectation is taken w.r.t. (possible) randomness of operator $\cC$. 
% This inequality implies that 
% \begin{equation*}
%     \E{\norm{\cC(x)}^2} \leq (1 + \delta) \norm{x}^2.
% \end{equation*}
\end{definition}

One of the most known operators from this class is \textit{greedy sparsification} (\texttt{Top-K} for $K \in \{1, \dots, d\}$):
\begin{equation*}
%\textstyle
\squeeze
    \cC_{\texttt{Top-K}}(x) \eqdef \sum\limits_{i=d-K+1}^d x_{(i)} e_{(i)},
\end{equation*}
where coordinates are ordered by their magnitudes so that $|x_{(1)}| \leq |x_{(2)}| \leq \cdots \leq |x_{(d)}|$, and $e_1, \dots, e_d \in \Rd$ are the standard unit basis vectors. This compressor belongs to $\bB\br{K/d}$.

\begin{definition}[Unbiased compressor] \label{def:unbiased_compressor}
A randomized mapping $\cQ: \bR^d \to \bR^d$ is an {\bf unbiased compression operator} ($\cQ \in \bU(\omega)$ for brevity) if for some $\omega \geq 0$ and $\forall x \in \bR^d$
\begin{equation*}
\begin{aligned} \squeeze
    &(a) \E \cQ(x) = x, \hspace{58pt} \text{  (Unbiasedness)} \\ %\hspace{69pt}
    &(b) \E\norm{\cQ(x) - x}^2 \leq \omega \norm{x}^2 \quad \text{ (Bounded variance)} %\quad
\end{aligned}
\end{equation*}
% \begin{equation*}
%     \E \cQ(x) = x, \qquad \E\norm{\cQ(x) - x}^2 \leq \omega \norm{x}^2
% \end{equation*}
The last inequality implies that 
\begin{equation} \label{eq:second_moment}
\squeeze
    \E \sqn{\cQ(x)} \leq (1 + \omega) \sqn{x}.
\end{equation}
\end{definition}
A notable example from this class is the \textit{random sparsification} (\texttt{Rand-K} for $K \in \{1, \dots, d\}$) operator:
% This class includes compressors like 
% $\bullet$ \textit{Rand-K} \textit{sparsification}: 
\begin{equation} \label{eq:rand-k}
%    \textstyle
\squeeze
    \cQ_{\texttt{Rand-K}}(x) \eqdef \frac{d}{K} \sum\limits_{i \in S} x_i e_i,
\end{equation}
% $\cQ(x) = \frac{d}{K} \sum_{i \in S} x_i e_i \in \bU\br{\frac{d}{K}+1}$, 
where $S$ is a random subset of $[d] \eqdef \{1, \dots, d\}$ sampled from the uniform distribution on the all subsets of $[d]$ with cardinality $K$. \texttt{Rand-K} belongs to $\bU\br{d/K-1}$.

% $\bullet$ Random \textit{Quantization}: 
% \begin{equation}
%     \sbr{\cQ(x)}_i = \sign(x) \cdot \|x\|_2 \cdot \xi_i(x, s),
%     % \cQ(x) = \sign(x) \cdot\|x\|_{p} \cdot \frac{1}{s} \cdot\left\lfloor s \frac{|x|}{\|x\|_p} + \xi\right\rfloor,
% \end{equation}
% % for random variable $\xi \sim_{u.a.r.} [0, 1]^d$, parameter $p \geq 1$ and $s \in \bN_+$, denoting the levels of the rounding. 
% where

Notice that property (a) from Definition~\ref{def:unbiased_compressor} is \say{uniform} across all vectors $x$, while property (b) is not. Namely, vector $x = 0$ is treated \emph{in a special way} because $\E \sqn{\cQ(0) - 0} = 0,$
% \begin{equation*}
%     \E \sqn{\cQ(0) - 0} = 0,
% \end{equation*}
which means that the compressed zero vector has \emph{zero variance}. In other words, zero is mapped to itself with probability 1.

\subsection{Compression with Shift}
% Generalization of the class of unbiased compressors.
We can generalize the class of unbiased compressors $\bU(\omega)$ to a class of operators with other (not only 0) \say{special} vectors. Specifically, this class allows for \textbf{shifts} away from the origin, which is formalized in the following definition.

\begin{definition}[Shifted compressor] \label{def:shifted_compressor}
% Let $h \in \bR^d$. 
A randomized mapping $\cQ_h: \bR^d \to \bR^d$ is a \textbf{shifted compression operator} ($\cQ_h \in \bU(\omega; h)$ in short) if exists $ \omega \geq 0$ such that $\forall x \in \bR^d$
% \begin{equation*}
%     \E \cQ_h(x) = x, \qquad \E\norm{\cQ_h(x) - x}^2 \leq \omega \norm{x - h}^2.
% \end{equation*}
\begin{equation*}
\begin{aligned}
\squeeze
    &(a) \E \cQ_h(x) = x \\
    &(b) \E\norm{\cQ_h(x) - x}^2 \leq \omega \norm{x - h}^2.
\end{aligned}
\end{equation*}
Vector $h \in \bR^d$ is called a \textbf{shift}. Note that class of unbiased compressors $\bU(\omega)$ is equivalent to $\bU(\omega; 0)$.
\end{definition}

The next lemma shows that shifts add up and all shifted compression operators $\cQ_h \in \bU(\omega; h)$ arise by a shift of some operator $\cQ_0$ from $\bU(\omega; 0)$.
    
\begin{lemma}[Shifting a shifted compressor] \label{lemma:shifting_shift}
    Let $\cQ_h \in \bU(\omega; h)$ and $v \in \bR^d$. Then the (possibly) randomized mapping $\cQ$ defined by
    \begin{equation*}
    % \textstyle
    \squeeze
        \cQ(x) \eqdef v + \cQ_h (x - v) 
    \end{equation*}
    satisfies $\cQ \in \bU(\omega; h + v)$.
\end{lemma}
% This Lemma allows to transform any unbiased compressor $\cQ_0$ into a shifted one.
% \egor{Maybe include proof (and Representation Theorem) in the Appendix.}
% We are going to be mainly interested in additively shifted compressors for the first part of the paper (compressing the gradients).

% Now we are ready to introduce how this shifted compressor is used in our framework.
\emph{The shifted compressor} concept allows us to construct a
shifted compressed \textbf{gradient estimator} $\cQ_h \in \bU(\omega; h)$ given by %(Unbiasedness and variance)
\begin{equation} \label{eq:shifted_gradient}
    \squeeze
    g_h(x) = \cQ_h\br{\nabla f(x)} = h + \cQ(\nabla f(x) - h),
\end{equation}
% for $\cQ_h \in \bU(\omega; h)$. 
which is the main focus of this work. In particular, we are going to study different mechanisms for choosing this shift vector throughout the optimization process.

\textit{Note:} The estimator~\eqref{eq:shifted_gradient} is clearly unbiased, as soon as the operator $\cQ$ satisfies $\E\cQ(x) = x$. %in unbiased.

% This kind of an estimator was first introduced in \algname{SEGA} paper \citep{hanzely2018sega} for a very special case of a randomized sketch operator $\cQ$. Later in \citep{sigma_k}, the stochastic gradient method called \algname{SGD-star} had a similar structure, with $\cC$ chosen as the identity operator, and $h = \nabla f(x^\star)$.
% A very special case of this estimator at first was introduced in \algname{SEGA} paper \citep{hanzely2018sega} of operator $\cQ$ as sketches.

Estimator~\eqref{eq:shifted_gradient} uses operator $\cQ$ from class of unbiased compressors $\bU(\omega)$, which are usually easier to analyze but have higher empirical variance than their biased counterparts \citep{beznosikov2020biased}. 
In an attempt to kill two birds with one stone, we can incorporate the (possibly) biased compressor $\cC \in \bB(\delta)$ into $h$ using a similar shift trick:
% make use of $h$ by choosing it in the following way
\begin{equation} \label{eq:structured_shift}
\squeeze
    h = s + \cC(\nabla f(x) - s),
\end{equation}
as $g_h(x)$ allows for virtually any shift vector. This leads to the following estimator\footnote{The resulting estimator is related to induced compressor \citep{horvath2021better} $\cQ_{ind}(x) = \cC(x) + \cQ\br{x - \cC(x)}$, which belongs to the $\bU(\omega(1 - \delta))$ class
for $\cC \in \bB(\delta)$ and $\cQ \in \bU(\omega)$.}
\begin{equation}
\boxed{
\begin{aligned}
    g_h(x) &= h + \cQ\br{\nabla f(x) - h} \\
    &= s + \cC(\nabla f(x) - s) \\&\hspace{18pt} + \cQ\br{\nabla f(x) - s - \cC(\nabla f(x) - s)}.
\end{aligned}
}
\end{equation}

\subsection{The Meta-Algorithm}

Now we are ready to present the general distributed optimization algorithm for solving~\eqref{eq:problem} that employs shifted gradient estimators
\begin{equation*}
    g_h(x) = 
    \frac{1}{n} \sum_{i=1}^n g_{h_i}(x) =  
    \frac{1}{n} \sum_{i=1}^n \sbr{h_i + \cQ_i \br{\nabla f_i(x) - h_i}}.
\end{equation*}
\begin{algorithm}[H]
\begin{algorithmic}[1] \caption{Distributed Compressed Gradient
Descent with Shift (\algname{DCGD-SHIFT})} \label{alg:dcgd_shift}
    \State \textbf{Parameters:} learning rate $\gamma>0$; unbiased compressors $\cQ_1, \ldots, \cQ_n$; initial iterate $x^0 \in \bR^d$, initial local shifts $h_1^0, \ldots, h_n^0 \in \bR^d$ (stored on the $n$ nodes)
    \State \textbf{Initialize:} $h^0 = \frac{1}{n} \sum_{i=1}^n h_i^0$ (stored on the master)
    \For{$k = 0, 1, 2 \ldots$}
    \State Broadcast $x^k$ to all workers
    \For{$i = 1, \ldots n$} in parallel
    \State Compute local gradient: $\nabla f_i(x^k)$
    \State Compress: $m_i^k = \cQ_i(\nabla f_i(x^k) - h_i^k)$ % Compress shifted local gradient
    \State $\mathcolorbox{myred}{\text{Update the local shift: } h_i^{k+1}}$
    % \hl{Update the local shift: } \hlfancy{myred}{$h_i^{k+1}$}
    % {\color{red} Update the local shift: $h_i^{k+1}$}
    \State Send $m_i^k$ and/or (maybe) $h_i^{k+1}$ to the master %\label{alg:diana_n:shift}
%    Send message $m_i^k$ and (possibly) the shifts $h_i^{k+1}$ to the master %\label{alg:diana_n:shift}
    \EndFor
    \State Aggregate received messages: $m^k = \frac{1}{n} \sum_{i=1}^n m_i^k$
    \State Compute global estimator: $g^k = h^k + m^k$ % Compute global gradient estimator
    \State Take gradient descent step: $x^{k+1} = x^k - \gamma g^k$
    \State $\mathcolorbox{myred}{\text{Update aggregated shift: } h^{k+1} = \frac{1}{n} \sum_{i=1}^n h_i^{k+1}}$ \label{alg:dcgd_shift:line:master_shift}
    % \hl{Update aggregated shift: $h^{k+1} = \frac{1}{n} \sum_{i=1}^n h_i^{k+1}$}\label{alg:dcgd_shift:line:master_shift}
    % {\color{red}Update aggregated shift: $h^{k+1} = \frac{1}{n} \sum_{i=1}^n h_i^{k+1}$}\label{alg:dcgd_shift:line:master_shift} %  = h^{k} + \alpha m^{k} ~\ref{alg:diana:shift}
    \EndFor
\end{algorithmic}
% \egor{Does it make sense to include compression on the master for maximum generality?}
% \egor{Do I need to include definition of proximal operator or it can be omitted in this work?}
\end{algorithm}
% \egor{Add method description in words.}

In Algorithm~\ref{alg:dcgd_shift}, each worker $i=1, \dots, n$ queries the gradient oracle $\nabla f_i (x^k)$ in iteration $k$. Then, a compression operator is applied to the difference between the local gradient and shift, and the result is sent to the master (and also possibly the new shift). The shift is updated on both the server and workers. After receiving the messages $m_i^k$, a global gradient estimator $g^k$ is formed on the server, and a gradient step is performed.

Note that this method is not fully defined because it requires a description of the mechanism for updating the shifts $h_i^{k+1}$ (highlighted in \hl{color})
%{\color{red}red}) 
throughout the iteration process on both workers and master. In the next section, we illustrate how the shifts can be chosen and updated.

% In the next section, we provide a general convergence guarantee for this algorithm with fixed shifts 
% \begin{equation} \label{eq:fixed_shift}
%     h_k^i \equiv h^i
% \end{equation} and in the next section discuss how it can be fixed and later made practical.

% How can we construct the estimator to eliminate the neighborhood?

% We ask the question:
% \begin{center}
% \say{\textit{
% How we can learn the optimal shift and are there any better ways to do it? % can we come up with some better approaches
% % What is the best way to learn the shift?
% }}
% \end{center}


% \subsection{Shifting compressed shift}
% \subsection{Updating the shifts}

\section{CHOOSING THE SHIFTS} % Choosing the shifts % Applications (Special Cases)
% Recall the shifted compressed gradient estimator~\eqref{eq:shifted_gradient}. 
% Next we will use the same designation for both simple shift $s$ and structured $h$.
First, in Table~\ref{table:shifts}, we show the generality of our approach by presenting some of the existing and new distributed methods that fall into our framework of \algname{DCGD-SHIFT} with shift updates of the form~\eqref{eq:structured_shift}.
% At first in Table~\ref{table:shifts} we present some of the existing and new distributed methods which fit our general framework of \algname{DCGD-SHIFT} with shift updates of the form~\eqref{eq:structured_shift}.
% Table~\ref{table:shifts} summarizes obtained results in a simplified way for the strongly convex case. %Next we will discuss every shift update in details. %\vspace{5pt}\\
% \footnotetext{Iteration complexity of Algorithm~\ref{alg:dcgd} in line~\ref{alg:dcgd:master_shift}. For simplicity of exposition rates are presented for case $\omega_i \equiv \omega$, $L_i \equiv L$ and $p_i \equiv p$. More refined results are in Appendix.}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Draw connections to \textbf{Induced compressor} \citep{horvath2021better} $\cQ_{ind}(x) \eqdef \cC(x) + \cQ\br{x - \cC(x)}$ belongs to $\bU(\omega(1 - \delta))$ class
% for $\cC \in \bB(\delta)$ and $\cQ \in \bU(\omega)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \paragraph{The most general gradient estimator structure} % shift
% \begin{equation}
% \boxed{
% \begin{aligned}
%     g^{k+1} &= s^{k+1} + \cQ_{ind}(\nabla f(x^{k+1}) - s^{k+1}) \\
%             &= s^{k+1} + \cC(\nabla f(x^{k+1}) - s^{k+1}) + \cQ\br{\nabla f(x^{k+1}) - s^{k+1} - \cC(\nabla f(x^{k+1}) - s^{k+1})} \\
%             &= h^{k+1} + \cQ\br{\nabla f(x^{k+1}) - h^{k+1}}
% \end{aligned}
% }
% \end{equation}
% with special shift $h^k = s^k + \cC(\nabla f(x^k) - s^k)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%% \vspace{-10pt}


\begin{table*}[h] %\label{table:shifts}
\setlength{\tabcolsep}{8pt}
\caption{%\small
List of existing and new algorithms that fit our general framework.
\textbf{VR} -- variance reduced method. $\mathcal{O}/ \mathcal{I}$ -- zero/identity operator, 
$\mathcal{B}_{p_i}$ -- Bernoulli\protect\footnotemark~compressor. 
% $\mathcal{B}\textit{e}_{p_i}$-- Bernoulli\footnotemark~compressor. 
\textsc{DGD} refers to Distributed Gradient Descent.}% \footenotemark.
% \footenotetext[2]{$\mathcal{B}\textit{e}(p)(x) = \left\{\begin{array}{ll}\frac{x}{p} & \text { with probability } p \\ 0 & \text { with probability } 1-p\end{array}\right.$}
% Summary of results for studied shifts. %\textbf{VR} corresponds to variance reduction for noise due to compression (for this case it is equivalent to linear convergence to the exact optimum).  
% For simplicity of demonstration rates are presented for case $\omega_i \equiv \omega$, $L_i \equiv L$ and $p_i \equiv p$.}
\centering
{\def\arraystretch{1.5}%\tabrowsep=10pt
% \specialrule{1pt}{1pt}{1pt}
\begin{tabular}{ccc C{0.15\textwidth}C{0.15\textwidth}}
\toprule
%\label{table:shifts}
% {p{0.1\textwidth}p{0.8\textwidth}}
% \hline
& & & \multicolumn{2}{c}{\textbf{Shift} $h_i^{k+1} = s_i^k + \cC_i\br{\nabla f_i (x^k) - s_i^k}$} \\
% \hline
\cmidrule(r){4-5}
\textbf{Method}  &  \textbf{Reference}  &  \textbf{VR}  &  $s_i^k$  &  $\cC_i$ \\
\hline
% \specialrule{1.5pt}{0.1pt}{0.1pt}
% \hhline{|====|}
\textsc{DCGD} & \citep{khirirat2018distributed} & \xmark & $0$ & $\mathcal{O}$ \\
% \hline
\textsc{DCGD-SHIFT} & (this work) & \xmark & $s_i^0$ & $\mathcal{O}$ \\
% \hline
\textsc{DGD} & (folklore) & \cmark & $0$ & $\mathcal{I}$ \\
% \hline
\textsc{DCGD-STAR} & (this work) & \cmark & $\nabla f_i(x^\star)$ & any $\cC_i \in \bB(\delta)$ \\
%any $\cC_i$ \\ % $\mathcal{O}$
% \hline
\textsc{DIANA} & \citep{mishchenko2019distributed} & \cmark & $h_i^k$ & $\alpha \cQ_i, \ \cQ_i \in \bU(\omega_i)$ \\
% $\alpha \cQ_i$\\
% \hline
\textsc{Rand-DIANA} & (this work) & \cmark & $h_i^k$ & $\mathcal{B}_{p_i}$ \\
% \hline
% \hhline{|===|}
% \specialrule{1.5pt}{1pt}{1pt}
\textsc{GDCI} & \citep{chraibi2019distributed} & \xmark & $x^k / \gamma$ & $\mathcal{O}$ \\
% \hline
% \textsc{VR-GDCI} & \citep{chraibi2019distributed} & \multicolumn{2}{c|}{$h_i^k - \alpha \cdot \gamma \cQ \br{\nabla f_i(x^k) - \sbr{x^k + h_i^k}/\gamma}$} \\
%%%$h_i^k - \alpha \cdot \gamma \cQ \br{\nabla f_i(x^k) - \sbr{x^k + h_i^k}/\gamma}$ \\
%%% $\max \left\{\frac{L}{\mu} + \frac{6\omega}{n} \br{\frac{L_{\max}}{\mu} - 1}, 2(\omega + 1)\right\}$
% \hline
\bottomrule
% \specialrule{1pt}{1pt}{1pt}
\end{tabular}
}
% \egor{Maybe use structured updates $s^k = h^k + \cC(\nabla f(x^k) - h^k)$ to make it look neater?}
% \egor{\algname{VR-GDCI} was obtained as \CGD with special shifted compressor.}
\label{table:shifts}
\end{table*}
% \algname{DCGD-STAR} & [\textbf{New}] & \cmark & $\nabla f_i(x^\star)$ & any $\cC_i \in \bB(\delta)$ \\ % $\mathcal{O}$
% \algname{DIANA} & \citep{mishchenko2019distributed} & \cmark & $h_i^k$ & $\alpha \cQ_i, \ \cQ_i \in \bU(\omega_i)$\\

% \vspace{-3pt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% \paragraph{Assumptions.} 
The following assumptions are needed to analyze convergence and compare with previous results.
% Next we list assumptions required for analysis of the described method.
% We will use the following assumptions.

% \begin{assumption}[Convexity]
% Function $f: \bR^d \to \bR$ is convex iff
% \begin{equation*}
%     f(x) \geq f(y) + \left\langle\nabla f(y), x - y\right\rangle, \qquad \forall x, y \in \Rd.
% \end{equation*}
% \end{assumption}

\begin{assumption}[Strong convexity]%\footnote{This assumption can be (possibly) relaxed to quasi-strong convexity: $f(x^\star) \geq f(y) + \left\langle\nabla f(y), x^\star - y\right\rangle + \frac{\mu}{2} \sqn{x^\star - y}$, if $f$ has a unique minimizer $x^\star$.}]
Function $f: \bR^d \to \bR$ is $\mu$-strongly convex if
\begin{equation*}
    f(x) \geq f(y) + \left\langle\nabla f(y), x - y\right\rangle + \frac{\mu}{2} \sqn{x - y}, \ \forall x, y \in \Rd.
\end{equation*}
If $\mu = 0$, then the function is $\text{convex}$.
\end{assumption}

\begin{assumption}[Smoothness]
Function $f: \bR^d \to \bR$ is $L$-smooth if
\begin{equation*}
    f(x) \leq f(y) + \left\langle\nabla f(y), x - y\right\rangle + \frac{L}{2} \sqn{x - y}, \ \forall x, y \in \Rd.
\end{equation*}
\end{assumption}
%
% \egor{Maybe introduce assumptions in a more natural way for a more concrete setting (e.g., specialized for $f_i$ and $f$).}
%
% \egor{Should I mention/describe single node case at first to make the exposition easier?}

Now, we can provide a general convergence guarantee for Algorithm \ref{alg:dcgd_shift}  with fixed shifts
\begin{equation} \label{eq:fixed_shift}
    h^k_i \equiv h_i.
\end{equation}

\begin{theorem}[\DCGD with \texttt{fixed\,\algname{SHIFT}}] \label{thm:fixed_shift}
    Assume each $f_i$ is convex and $L_i$-smooth, and $f$ is $L$-smooth and $\mu$-strongly convex. Let $\cQ_i \in \bU(\omega_i)$ be independent unbiased compression operators.
    If the step-size satisfies 
    \begin{equation*}
        \gamma \leq \frac{1}{L + 2\max_i\br{L_i \omega_i/n}},
        % \gamma \leq 1/\sbr{L + 2\max_i\br{L_i \omega_i/n}}.
        % \gamma \leq \frac{1}{L + \max_i\br{L_i \omega_i/n}}, 
        % \gamma \leq \frac{1}{L + \max_i\br{L_i \omega_i(1 - \delta_i)/n}},
        %2\sbr{L + \max_i\br{L_i \omega_i/n}}
    \end{equation*}
    then the iterates of Algorithm~\ref{alg:dcgd_shift} with fixed shifts $h_i^k \equiv h_i$ satisfy
    \begin{equation} \label{eq:fixed_shift_res}
    \begin{aligned}
        \E\sqN{x^k - x^\star} &\leq (1 - \gamma \mu)^k \sqn{x^0 - x^\star} \\ &\quad \ + 
        \frac{2 \gamma}{\mu} \frac{1}{n}\sum_{i=1}^n \frac{\omega_i}{n} \sqN{\nabla f_i (x^\star) - h_i}.
        %\frac{2 \gamma \omega}{\mu} \sqN{\nabla f (x^\star) - h}
        \end{aligned}
    \end{equation}
\end{theorem}
% \textbf{Proof in the Appendix.}
% \egor{I will include the proof in the Appendix.}
% \begin{proof}
%     Follows from CS 331 Lectures Lemma 45. \peter{You need to type the proof and include it in the appendix.}
% \end{proof}
% \vspace{-5pt}
This theorem establishes a linear convergence rate up to a certain oscillation radius, controlled by the average distance of shift vectors $h_i$ to the optimal local gradients $\nabla f_i(x^\star)$ multiplied by the step-size $\gamma$. This means that in the interpolation/\textbf{overparameterized regime} ($\nabla f_i(x^\star) = 0$ for all $i$), method reaches \textbf{exact solution} with zero shifts $h_i^0 = 0$. %Otherwise, for reaching the optimal solution, we have to use decreasing step-sizes $\gamma^k$ and lose linear convergence rate.
%
% This method works perfectly in the interpolation regime: $\nabla f_i(x^\star) = 0$ (for all $i$) if we choose zero shifts $h_i^0 = 0$. Otherwise, if we want to reach the optimal solution, we have to use decreasing step-sizes $\gamma^k$ and lose linear convergence rate.
% for a general case without interpolation.

In the following subsections, we study how the shifts can be formed to guarantee linear convergence to the exact optimum. We start by introducing practically useless, but theoretically insightful \algname{DCGD-STAR}, and then move onto implementable algorithms that learn the optimal shifts.

% Distributed Compressed Gradient Descent with optimal Shifting
\subsection{Optimal Shifts}% Optimally shifted compressed gradients (\algname{DCGD-SHIFT$^\star$}/\algname{STAR})}
% \egor{This can be introduced through the analysis of \CGD like in the CS 331 Lectures. Maybe it can be done in the Appendix?}
Assume, for the sake of argument, that we know the values $\nabla f_i (x^\star)$ for every $i \in [n]$. Then, we can construct optimally shifted compressed shift updates sequence using the form~\eqref{eq:structured_shift}
% Works in theory but not in practice.
% \subsubsection{Shifted compressed optimal shift}
% \paragraph{Shifted compressed optimal shift}
\begin{equation} \label{eq:opt_shift}
    h_i^{k+1} = \nabla f_i (x^\star) + \cC_i(\nabla f_i (x^k) - \nabla f_i (x^\star)).
\end{equation}
This is enough to fully characterize the Algorithm~\ref{alg:dcgd_shift} and obtain the following convergence guarantee: %For this kind of method we have the following convergence guarantee.
\begin{theorem}[\algname{DCGD-STAR}] \label{thm:optimal_shift}
    Assume each $f_i$ is convex and $L_i$-smooth, and $f$ is $L$-smooth and $\mu$-strongly convex. Let $\cQ_i \in \bU(\omega_i), \cC_i \in \bU(\delta_i)$ be independent compression operators.
    If the step-size satisfies 
    \begin{equation} \label{eq:stepsize_optimal}
    % \textstyle 
        % \gamma \leq 1/\sbr{L + \max_i\br{L_i \omega_i(1 - \delta_i)/n}},%^{-1},
        \gamma \leq \frac{1}{L + \max_i\br{L_i \omega_i(1 - \delta_i)/n}},
    \end{equation} %where $\delta_i$ should be interpreted as zero for $\cC_i \equiv 0$.
    then the iterates of \DCGD with \textbf{optimally shifted compressed shift} update~\eqref{eq:opt_shift} satisfy
    \begin{equation*} 
    % \textstyle
        \E\sqN{x^k - x^\star} \leq (1 - \gamma \mu)^k \sqn{x^0 - x^\star}.
    \end{equation*}
\end{theorem}
This is the first presented algorithm with linear convergence to the exact solution for the general \emph{not-overparameterized case}.
Notice that for zero-identity operators $\cC_i \equiv 0$ we obtain the simplest optimal shift $h_i = \nabla f_i (x^\star)$ and the term $\delta_i$ in~\eqref{eq:stepsize_optimal} should be interpreted as zero.

The issue with the described method is that, in general, we do not know the values $h_i^\star \eqdef \nabla f_i(x^\star)$ (unless the problem is overparametrized), which makes method impractical.
% \egor{Say about communication of auxiliary vectors to the master.}

\footnotetext{$\mathcal{B}_p (x) \eqdef \left\{\begin{array}{ll} x & \text { with probability } p \\ 0 & \text { with probability } 1-p\end{array}\right.$}

\subsection{Learning the Optimal Shifts} \label{section:learn_shift} % \DCGD with learned Shifts
% \algname{DCGD-SHIFT$^\star$} suffers from the main 2 issues
We need to design the sequences $\{h_1^k\}_{k\geq0}, \dots, \{h_n^k\}_{k\geq0}$ in such a way that they all converge to the optimal shifts:
\begin{equation*} %\textstyle
\squeeze
    h_i^k \to \nabla f_i (x^\star) \quad \text{ as } \quad  k \to \infty.
\end{equation*}
However, at the same time, we do not want to send uncompressed vectors from workers to the master. So, the challenge is not only learning the shifts, but doing so in a communication-efficient way. We present two different solutions to this problem in this work.

\subsubsection{\algname{DIANA}-like Trick} % Adjusting \DIANA-trick
Our first approach is based on the celebrated \DIANA \citep{mishchenko2019distributed, horvath2019stochastic} algorithm:
% Recovers original \DIANA 
% \paragraph{Standard \algname{DIANA}}
\begin{equation} \label{eq:diana_shift}
\begin{aligned}
\squeeze
    h_i^{k+1} &= h_i^k + \alpha \big[\cC_i(\nabla f_i(x^k) - h_i^k) \\& \ + 
    \cQ_i\br{\nabla f_i(x^k) - h_i^k - \cC_i(\nabla f_i(x^k) - h_i^k)}\big],
    %\cQ(\nabla f_i (x^k) - h_i^k),
\end{aligned}
\end{equation}
where $\alpha$ is a suitably chosen step-size. For $\cC_i \equiv 0$, it takes the simplified form
\begin{equation} \label{eq:diana_shift_2}
% \textstyle
\squeeze
    h_i^{k+1} = h_i^k + \alpha \cQ_i\br{\nabla f_i(x^k) - h_i^k}.
\end{equation}
% You may ask why this update makes sense? 
This recursion resolves both of the raised issues earlier.
Firstly, this sequence of $h_i^k$ indeed converges to the optimal shifts $\nabla f_i(x^\star)$, which is formalized in the Theorem~\ref{thm:diana_shift} presented later. Moreover, the shift on the master $h^{k+1} = 
    \frac{1}{n} \sum_{i=1}^n h_i^{k+1}$ is updated as follows:
\begin{equation*} %\textstyle
\begin{aligned} %\textstyle
    h^{k+1} &= 
    \frac{1}{n} \sum_{i=1}^n \Big\{h_i^k + \alpha \big[\cC_i(\nabla f_i(x^k) - h_i^k) \\&\qquad\ + 
    \cQ_i\br{\nabla f_i(x^k) - h_i^k - \cC_i(\nabla f_i(x^k) - h_i^k)} \big]\Big\} 
    \\&=
    \frac{1}{n} \sum_{i=1}^n h_i^k + \alpha \frac{1}{n} \sum_{i=1}^n \left\{c_i^k + m_i^k\right\} \\&= 
    h^k + \alpha \br{c^k + m^k},
\end{aligned}
\end{equation*}
which requires aggregation of the compressed vectors $c_i^k \eqdef \cC_i(\nabla f_i(x^k) - h_i^k)$ and $m_i^k \eqdef \cQ_i\br{\nabla f_i(x^k) - h_i^k - c_i^k}$ from the workers. In the case of update~\eqref{eq:diana_shift_2}, it is not even needed to send anything in addition to the messages $m_i^k$ required by default in Algorithm~\ref{alg:dcgd_shift}. % are sent by default, as they form the gradient estimate $g^k$.

Furthermore, simplified recursion \eqref{eq:diana_shift_2} can be interpreted as one step of Compressed Gradient Descent (\algname{CGD}) with step-size $\alpha$ %\leq \frac{1}{\omega_i+1}$ 
applied to such optimization problem:
\begin{equation*} %\label{eq:}
% \textstyle
    \max_{h_{i} \in \Rd} \left[ \phi_{i}^{k}\left(h_{i}\right) \eqdef -\frac{1}{2}\left\|h_{i}-\nabla f_{i}(x^{k})\right\|^{2} \right],
\end{equation*}
which is in fact a 1-smooth and 1-strongly concave function. In this way, $h_i^{k+1}$ keeps track of the latest local gradient and produces a better estimate than the previous shift $h_i^k$.
% Communication trade-off

% A very important benefit of this method is that it does not need to transmit any additional information apart from compressed difference, because shift update is performed on every iteration and server can do it with the same vector as the one used for constructing the gradient estimator.


Now we present the convergence result for the Algorithm~\ref{alg:dcgd_shift} with described before shift learning procedure.
\begin{theorem}[Generalized \DIANA] \label{thm:diana_shift}
    Assume each $f_i$ is convex and $L_i$-smooth, and $f$ is $\mu$-strongly convex. Let $\cQ_i \in \bU(\omega_i), \cC_i \in \bU(\delta_i)$ be independent compression operators.
    If the step-sizes for all $i$ satisfy
    % \begin{equation*}
    %     \alpha \leq \frac{1}{1+\omega_i(1 - \delta_i)} %\text{ (for all } i)
    %     ,
    %     \gamma \leq \frac{1}{\frac{2}{n} \max_i\br{\omega_i L_i} + (1 + \alpha M) L_{\max}},
    %     % \gamma \leq \frac{1}{\br{1 + 2\omega(1 - \delta)/n + \alpha M}L_{\max}},
    % \end{equation*}
    \begin{equation*}
    \begin{aligned}
        \alpha &\leq \frac{1}{1+\omega_i(1 - \delta_i)} %\text{ (for all } i)
        ,\\
        \gamma &\leq \frac{1}{\frac{2}{n} \max_i\br{\omega_i L_i} + (1 + \alpha M) L_{\max}},
        % \gamma \leq \frac{1}{\br{1 + 2\omega(1 - \delta)/n + \alpha M}L_{\max}},
    \end{aligned}
    \end{equation*}
    where $L_{\max} \eqdef \max_i L_i, M > 2/(n \alpha)$ and $\delta_i$ should be interpreted as zero for $\cC_i \equiv 0$,
    % Then the iterates of \DIANA with shifted compressed shift satisfy
    then the iterates of \DCGD with the \DIANA-like shift update~\eqref{eq:diana_shift} satisfy
    \begin{equation*}
    \E V^k \leq \max \left\{(1-\gamma \mu)^k, \left(1 - \alpha + \frac{2\omega}{n M}\right)^k\right\} V^0,
    \end{equation*}
    where the Lyapunov function $V^k$ is defined by
    \begin{equation*} %\label{eq:lyapunov}
    \textstyle
        V^k \eqdef \left\|x^k - x^\star\right\|^2 + M \gamma^2 \cdot \frac{1}{n}  \sum\limits_{i=1}^{n} \omega_i \left\|h_i^k - \nabla f_i (x^\star)\right\|^2.%\sigma^k.
    \end{equation*}
\end{theorem}
Our result represents an improvement over the original \DIANA in several ways. Firstly, we use a much more general shift updates involving $\cC_i$, which allow biased operators to be used for learning the optimal shifts. Secondly, one can use different compressors $\cQ_i$, which can be particularly beneficial when different workers have various bandwidths/connection speeds to the master. Thus, the slower workers can compress more, and therefore use operators with higher $\omega_i$. At the same, time the opposite makes sense for \say{faster} workers.

\subsubsection{Randomized DIANA (\algname{Rand-DIANA})}

Recalling the original issue stated in Section~\ref{section:learn_shift} that we are dealing with:
% We need to design the sequences $\{h_1^k\}_{k\geq0}, \dots, \{h_n^k\}_{k\geq0}$ in such a way that all of them converge to the optimal shifts:
\begin{equation*}
    \text{\textbf{design sequences} } \{h_i^k\}_{k\geq0} \ \text{ such that } \ h_i^k \to \nabla f_i (x^\star). 
    %\quad \text{ as } \quad  k \to \infty.
\end{equation*} % you can come up with the simplest possible solution
The simplest possible solution would be just to set $h_i^k$ to $\nabla f_i(x^k)$ because if $x^k \to x^\star$ in the optimization process, then $\nabla f_i(x^k)$ converges to the optimal local shift. However, this approach is not efficient, as workers have to transfer full (uncompressed) vectors $h_i^k = \nabla f_i(x^k)$. Our alternative to the \DIANA solution is to update a reference point $w_i^k$ for calculating the shift $h_i^k = \nabla f_i(w_i^k)$ infrequently (with a small probability $p_i \in (0, 1]$), so that $h_i^k$ needs to be communicated very rarely:
% in order to communicate 
\begin{equation} \label{eq:r_diana_shift}
\begin{aligned} 
    h_i^k &= \nabla f_i (w_i^k) \\
    w_i^{k+1} &= \left\{\begin{array}{ll}
                    x^k \, \text { with probability } p_i \\ 
                    w_i^k \text { with probability } 1 - p_i
                  \end{array}\right.
\end{aligned}
\end{equation}
% \begin{center}
% \begin{alignat*}{3} \label{eq:rand_diana}
%     h_i^k &= \nabla f_i (w_i^k) \\
%     w_i^{k+1} &= \left\{\begin{array}{ll}
%                     x^k \, \text { with probability } p_i \\ 
%                     w_i^k \text { with probability } 1 - p_i
%                   \end{array}\right.
% \end{alignat*}
% \end{center}
This method has a remarkably simpler analysis than \DIANA, but can solve the original problem of eliminating the variance introduced by gradient compression. Next, we state the convergence result for \DCGD with shifts updated in a \texttt{randomized} fashion~\eqref{eq:r_diana_shift}. We named it \algname{Randomized-DIANA} (\RDIANA in short) to acknowledge the original method \citep{mishchenko2019distributed} to first solve this problem.

\begin{theorem}[\algname{Rand-DIANA}] \label{thm:r_diana_shift}
Assume that $f_i$ are convex, $L_i$-smooth for all $i$ and $f$ is $\mu$-convex. If the step-size satisfies
\begin{equation*}
    \gamma \leq \frac{1}{\left(1+\frac{2 \omega}{n}\right) L_{\max} + M \max_i(p_i L_i)},
\end{equation*}
where $M > \frac{2 \omega}{n p_{m}}$ and $p_{m} \eqdef \min_i p_i$.
Then, the iterates of \DCGD with \algname{Randomized-DIANA} shift update~\eqref{eq:r_diana_shift} satisfy%then the iterates of \RDIANA satisfy%~\ref{alg:rand_diana_n} satisfy
\begin{equation*}
    \E V^k \leq \max \left\{(1-\gamma \mu)^k, \left(1 - p_{m} + \frac{2\omega}{nM}\right)^k\right\} V^0,
\end{equation*}
where the Lyapunov function $V^k$ is defined by
\begin{equation*}
% \textstyle
    V^k \eqdef \left\|x^k - x^\star\right\|^2 + M \gamma^2 \cdot \frac{1}{n}  \sum\limits_{i=1}^{n} \left\|h_i^k - \nabla f_i (x^\star)\right\|^2.%\sigma^k.
\end{equation*}
\end{theorem}
Though appropriate choice of the parameters $M = \frac{4 \omega}{n p_{m}}$ and $p_i \equiv p = \frac{1}{\omega + 1}$ for every $i$, we can obtain basically the same iteration complexity as the original \DIANA \citep{horvath2019stochastic}
\begin{equation*}
% \textstyle
    \max \left\{\frac{1}{\gamma \mu}, \frac{1}{p_{m} - \frac{2 \omega}{n M}}\right\} %\log \frac{1}{\varepsilon}
    = \max \left\{\frac{L_{\max}}{\mu}\br{1 + \frac{\omega}{n}}, \omega + 1\right\}. %\log \frac{1}{\varepsilon}.
\end{equation*}


\subsection{Compressing the Iterates} \label{section:compressed_iter}

In this section, we discuss how the shifted compression framework can be applied and leads to improved results for the case where the iterates/models themselves need to be compressed.

Let $\cQ \in \bU(\omega)$.
Consider the following shifted by vector $x/\gamma$ compressor
\begin{equation*}
    \hat{\cQ}(z) \eqdef \frac{x}{\gamma} + \cQ\br{z - \frac{x}{\gamma}},
\end{equation*}
which clearly belongs to the class $\bU(\omega; x/\gamma)$.
% which has the same variance as $\cQ$.
Based on the fact that for $\gamma \neq 0$ compressor $\bar{\cQ}(z) \eqdef - \frac{1}{\gamma} \cdot \cQ\br{-\gamma z} \in \bU(\omega)$ we can transform $\hat{\cQ}$  to operator
\begin{equation*} %\label{eq:shifted_iter}
%\begin{aligned}
    \tilde{\cQ} (z) \eqdef 
    \frac{x}{\gamma} + \bar{\cQ}\br{z - \frac{x}{\gamma}} = 
%    \frac{x}{\gamma} - \frac{1}{\gamma} \cdot \cQ\br{-\gamma \sbr{z-\frac{x}{\gamma}}} \\&= 
    \frac{1}{\gamma} \sbr{x - \cQ(x - \gamma z)},
%\end{aligned}
\end{equation*}
which also belongs to $\bU(\omega; x/\gamma)$ and is helpful for analysing algorithms with compressed iterates.

% \begin{equation*}
%   \E \sqN{\tilde{\cQ} (z) - z} \leq \omega \sqN{z - \frac{x}{\gamma}}
% \end{equation*}

\textbf{Distributed Gradient Descent with Compressed Iterates (\GDCI)} % Single node \GDCI 
was first analyzed by \cite{khaled2019gradient} for single node and, in short, was relaxed and formulated in a convenient form by \cite{chraibi2019distributed}:
\begin{equation} \label{eq:GDCI_basic}
    x^{k+1} = (1 - \eta) x^k + \eta \cQ\br{x^k - \gamma \nabla f(x^k)}. \tag{\GDCI} %1 node 
\end{equation}
This algorithm can be reformulated using the previously described shifted compressor $\tilde{\cQ} \in \bU(\omega; x^k/\gamma)$
\begin{equation*}
\begin{aligned}
    x^{k+1} &= 
    x^k - \br{\eta \gamma} \frac{1}{\gamma} \sbr{x^k - \cQ \br{x^k - \gamma \nabla f(x^k)}}
    \\&= 
    x^k - (\eta \gamma) \tilde{\cQ}^k(\nabla f(x^k)),%g^k
\end{aligned}
\end{equation*}
which for the distributed case takes the form
\begin{equation} \label{eq:distributed_gdci}
    x^{k+1} = 
    (1 - \eta) x^k + \eta \frac{1}{n} \sum_{i=1}^n \cQ_i \br{x^k - \gamma \nabla f_i(x^k)}.
    % x^k - \br{\eta \gamma} \frac{1}{\gamma} \sbr{x^k - \frac{1}{n} \sum_{i=1}^n \cQ_i \br{x^k - \gamma \nabla f_i(x^k)}}.
    % = x^k - (\eta \gamma) \tilde{\cQ}^k(\nabla f(x^k)),%g^k
\end{equation}
% \begin{equation}
% \begin{aligned}
%     x^{k+1} &= (1 - \eta) x^k + \eta \cQ \br{x^k - \gamma \nabla f(x^k)} 
%     \\ &= 
%     x^k - \br{\eta \gamma} \frac{1}{\gamma} \left[x^k - \cQ \br{x^k - \gamma \nabla f(x^k)}\right]
%     \\ &= 
%     x^k - (\eta \gamma) g^k
% \end{aligned}
% \end{equation}
%
% for $g^k = \frac{1}{\gamma} \left[x^k - \cQ \br{x^k - \gamma \nabla f(x^k)} \right] = \tilde{\cQ}^k(\nabla f(x^k))$.
%%%%% \vspace{-4pt}
\begin{figure*}[!h]
\centering
\begin{subfigure}{\textwidth}
    % \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana-randk_q-0.8.pdf}
    \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_spars.pdf}
    \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_nat-dith.pdf}
    % \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_vary-b.pdf}
    % \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_spars-avg_b.pdf}
\end{subfigure}
\caption{Comparison of \DIANA and \algname{Randomized-DIANA}. \textbf{Left plot}: methods equipped with \texttt{Rand-K} for different $q$ values.
\textbf{Right plot}: selected results of a grid search for the \texttt{ND} parameter $s$ over $\{2, \dots, 20\}$.}
\label{fig:r_diana-diana}
\end{figure*}
% \footenotetext{$[k] \eqdef \{2, \dots, k\}$ in this cases}

The essence of this method is compression of the local workers' iterates $\cQ_i \br{x^k - \gamma \nabla f_i(x^k)}$, their aggregation on the master and convex combination with the previous model. Next we present established linear convergence up to a neighborhood introduced due to variance of compression operator (similarly to \DCGD with fixed shifts Theorem~\ref{thm:fixed_shift}).

\begin{theorem}[\GDCI] \label{thm:gdci}
    Assume each $f_i$ is convex and $L_i$-smooth, and $f$ is $L$-smooth and $\mu$-strongly convex. Let $\cQ_i \in \bU(\omega)$ be independent compression operators.
    If the step-sizes satisfy
    \begin{equation*}
        \eta \leq \sbr{\frac{L}{\mu} + \frac{2\omega}{n}\br{\frac{L_{\max}}{\mu} - 1}}^{-1}, %\qquad
        \gamma \leq \frac{1 + 2\eta\omega/n}{\eta\br{L + 2 L_{\max}\omega/n}},
        % \frac{1}{L + \max_i\br{L_i \omega_i(1 - \delta_i)/n}}
    \end{equation*}
    then the iterates of the Distributed \GDCI \eqref{eq:distributed_gdci} 
    satisfy
    \begin{equation} \label{eq:converge_gdci}
    \begin{aligned}
        \E\sqN{x^k - x^\star} &\leq (1 - \eta)^k \sqn{x^0 - x^\star} \\& \ \ + 
        \eta \frac{2 \omega}{n} \frac{1}{n}\sum_{i=1}^n \sqN{x^\star - \gamma \nabla f_i (x^\star)}.
    \end{aligned}
\end{equation}
\end{theorem}
In the interpolation regime ($\nabla f_i(x^\star) = 0 = x^\star - \gamma \nabla f_i(x^\star)$, for every $i$) this result matches the complexity of \DCGD with fixed shifts~\eqref{eq:fixed_shift_res} 
\begin{equation*}
    \tilde{\cO}\br{\kappa \br{1 + \omega/n}}
\end{equation*}
and improves over the original rate of \GDCI by \citet{chraibi2019distributed} analyzed for fixed point problems and specialized for gradient mappings:
\begin{equation*}
    \tilde{\cO}\br{\kappa\max\left\{1, \kappa \omega/n\right\}} \gtrsim \tilde{\cO}\br{\kappa^2 \omega/n}.
\end{equation*}
% to lack of space in the main part of the paper
Due to space limitations, the results for {\bf Distributed Variance-Reduced Gradient Descent with Compressed Iterates (\algname{VR-GDCI})}, which eliminates the neighborhood in~\eqref{eq:converge_gdci}, along with detailed proofs of all stated theorems are presented in the Supplementary Material. \label{section:vr_gdci}

\iffalse
\subsubsection{\algname{VR-GDCI}}
Single-node VR-\GDCI has the following update form
\begin{equation}
\begin{aligned}
    x^{k+1} &= x^k - \eta \left(x^k - h^k -\delta^k\right) \\&= 
    x^k - (\eta\gamma) \frac{1}{\gamma} \left(x^k - h^k - \cQ\br{x^k - h^k - \gamma \nabla f(x^k)}\right) \\&= 
    x^k - (\eta\gamma) \tilde{\cQ}^k (\nabla f(x^k)); \\
    h^{k+1} &= 
    h^k - \alpha \cQ(x^k - h^k - \gamma \nabla f(x^k)).
\end{aligned}
\end{equation}

\begin{theorem} \label{thm:vr_gdci}
Let $\Psi^k$ be the following Lyapunov function:
\begin{equation}
    \Psi^k \eqdef \sqn{x^k - x^\star} + \frac{4 \eta^2 \omega}{\alpha} \frac{1}{n} \sum_{i=1}^n \sqN{h_i^k - (x^\star - \gamma \nabla f_i (x^\star))}.
\end{equation}
Assume each $f_i$ is convex and $L_i$-smooth, and $f$ is $L$-smooth and $\mu$-strongly convex. Let $\cQ_i \in \bU(\omega)$ - independent compression operators. 
Choose the step-sizes $\alpha, \eta, \gamma$ such that 
\begin{equation*}
    \alpha \leq \frac{1}{\omega + 1}, \qquad
    \eta = \sbr{\frac{L}{\mu} + \frac{6\omega}{n} \br{\frac{L_{\max}}{\mu} - 1}}^{-1}, \qquad
    \gamma \leq \frac{1 + 6\eta\omega/n}{\eta\br{L + 6 L_{\max}\omega/n}}
\end{equation*}
Then the iterates of Distributed \algname{VR-GDCI} satisfy
\begin{equation} %\label{eq:thm-GDCI-VR-1}
    \E \Psi^k \leq \br{1 - \min\left\{\frac{\alpha}{2}, \eta \right\}}^k \sqn{x^0 - x^\star} \Psi^0.
\end{equation}
\end{theorem}
\paragraph{Iteration complexity:} 
To reach $\varepsilon$ accuracy, the method requires
\begin{equation}
    k \geq \max \left\{\frac{L}{\mu} + \frac{6\omega}{n} \br{\frac{L_{\max}}{\mu} - 1}, 2(\omega + 1)\right\} \log \nicefrac{1}{\eps}
\end{equation}

\fi

%%%%% \vspace{-3pt}
\begin{figure*}[h]
\centering
\begin{subfigure}{\textwidth}
    \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_vary-b.pdf}
    \includegraphics[width=0.5\linewidth]{{plots/distributed/rand_diana/ridge-synthetic_sparse-q-0.1_vary-p}.pdf}
    % \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_sparse-q-0.5_vary-p.pdf}
    % \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_sparse-q-0.7_vary-p.pdf}
\end{subfigure}
\caption{Study of the stability and performance of \RDIANA with varying parameters $b$ and $p$.}
\label{fig:rand_diana}
%%%%% \vspace{-13pt}
\end{figure*}


\section{EXPERIMENTS} \label{section:exp}
% ~\footnote{Data was generated using method \href{https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html}{sklearn.datasets.make\_regression}.}
In this section, we present some of the experimental results obtained. The remainder of the results (including real-world data and other models) are available in the Supplementary Material.
% Section \ref{sec:additional_experiments} of the Supplementary Material. 
To provide evidence that our theory translates into observable predictions, we focus on well-controlled settings that satisfy the assumptions in our work.

Consider a classical ridge-regression optimization problem
% \vspace{-3pt}
% ($n = d = 100$) randomly generated
\begin{equation*} %\label{eq:ridge_problem}
    \min_{x \in \mathbb{R}^d} \left[f(x) \eqdef 
    \frac{1}{2} \| \mathrm{A} x - y \|^2 + \frac{\lambda}{2} \|x\|^2 \right],
\end{equation*}
% \vspace{-3pt}
where $\lambda = 1/m$ and $\mathrm{A} \in \bR^{m \times d}, y \in \bR^m$ are generated using the Scikit-learn library \citep{scikit-learn} method  \href{https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html}{sklearn.datasets.make\_regression} with default parameters for $m = 100, d = 80$. The obtained data is uniformly, evenly, and randomly distributed among 10 workers. % Performance
To compare selected algorithms, we evaluate the logarithm of a relative argument error $\log \br{\|x^k-x^\star\|^2 / \|x^0-x^\star\|^2}$ on the vertical axis, while the horizontal axis presents the %either iteration number (communication round/number of gradient calculations) or 
number of communicated bits needed to reach a certain error tolerance $\eps$. The starting point $x^0\in\Rd$ entries are sampled from the normal distribution $\mathcal{N}(0, 10)$.

In our simulations we thoroughly examine the \RDIANA method, which is presented for the first time. Extensive studies of the methods with compressed iterates can be found in the works by \cite{khaled2019gradient, chraibi2019distributed}.

% \subsection{\algname{Rand-DIANA}}
\subsection{\algname{Randomized-DIANA} vs \DIANA} \label{exp:r_diana_diana}
In the first set of experiments, we compare \RDIANA and \DIANA with different compressors $\cQ_i$ ($\cC_i \equiv 0$) and varied operators' parameters. The results obtained are summarized in Figure~\ref{fig:r_diana-diana}.
% \footnotetext{Rand-K is defined as $\cC(x) = \frac{d}{K} \sum_{i \in S} x_{i} e_{i}$, where $S$ is a random subset of $[d]$ sampled from the uniform distribution on the all subset of $[d]$ with cardinality $K$. Rand-K is unbiased and belongs to $\bU\br{\frac{d}{K}+1}$.}
The designation $q \eqdef k/d$ is used for the share of non-zeroed coordinates of the \texttt{Random sparsification} (\texttt{Rand-K}) operator, and $s$ corresponds to the number of levels for the \texttt{Natural Dithering} (\texttt{ND}) \citep{horvath2019natural} compressor. The $p$ parameter of \RDIANA was set at $1/(\omega+1)$ for every run.
% According to results summarized in 
% Figure~\ref{fig:r_diana-diana} summarizes obtained results. 

The left plot in Figure~\ref{fig:r_diana-diana} clearly shows that \RDIANA performs better than  \DIANA for every value of the \texttt{Rand-K} compressor parameter. It is worth noting that \DIANA performs better at higher $q$, while the opposite holds for \RDIANA.
% The difference is especially remarkable for higher compression. 

From the right plot in Figure \ref{fig:r_diana-diana}, one can see that \DIANA with \texttt{ND} can be superior to \RDIANA for the optimized parameter $s^\star$. Nevertheless, \RDIANA is highly preferable for very aggressive compression (e.g., $s=2$).

% {\bf Observation: } Figure~\ref{fig:rand_diana} clearly shows better experimental performance of \DIANA in comparison to \algname{Rand-DIANA} with $p = 1/(\omega+1)$ both for Random Sparsification and Natural Dithering compressors.

In the next experimental setup, we more closely investigate the behavior of \RDIANA with respect to its parameters.

\subsection{\algname{Randomized-DIANA} Study}
% In the next experimental set up we take a closer look at \RDIANA behavior.
% In this experimental setup, we closer investigate the behavior of \algname{Randomized-DIANA}.

According to the formulation of Theorem~\ref{thm:r_diana_shift}, the constant $M$ has to be strictly greater than $M^\prime \eqdef 2 \omega/(n p)$. In the left plot of Figure~\ref{fig:rand_diana}, we show that the method becomes less stable and can even diverge for smaller values of $M$ (set to $M^\prime \cdot b$). However, too high $M$ (for $b=1.5$) can lead to an overall (stable) slowdown. We conclude that the condition imposed by theoretical analysis is indeed critical.

The right plot in Figure \ref{fig:rand_diana} examines how the parameter $p$ affects the convergence in a high compression regime ($q=0.1$). The method converges faster for smaller $p$ and can diverge above a certain threshold, similarly to the previous study of $M$ trade-off. 
% for overly very small $p$ the method can diverge
% for too small values of $M$ set to $M^\prime \cdot b$ the method can diverge and. The right plot 

We did not conduct additional experiments to show the effect of combining unbiased compressors with biased counterparts, as the benefits of such an approach have already been clearly demonstrated by \cite{horvath2021better} for distributed training of deep neural networks.

% \section{CONCLUSION}
% \newpage

\iffalse
\begin{figure}[H]
\centering
\begin{subfigure}{0.5\textwidth}
    \includegraphics[width=\linewidth]{plots/distributed/rand_diana/ridge-synthetic_vary-b.pdf}
    % \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_spars-avg_b.pdf}
    \label{fig:rand_diana_divergence}
    \caption{Divergence of \algname{Rand-DIANA} for small $b$.}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}
    \includegraphics[width=\linewidth]{plots/distributed/rand_diana/ridge-synthetic_spars-avg_b.pdf}
    \label{fig:rand_diana_avg_b}
    \caption{Averaged 10 runs \algname{Rand-DIANA}.}
\end{subfigure}
\caption{Robustness of \algname{Rand-DIANA} for small $q$ with respect to the choice of multiplicative for $M$ factor $b$.}
\label{fig:rand_diana}
\end{figure}

% \iffalse
\begin{figure}[H]
\centering
\begin{subfigure}{\textwidth}
    \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_sparse-q-0.1_vary-p.pdf}
    \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_sparse-q-0.3_vary-p.pdf}
    % \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_sparse-q-0.5_vary-p.pdf}
    % \includegraphics[width=0.5\linewidth]{plots/distributed/rand_diana/ridge-synthetic_sparse-q-0.7_vary-p.pdf}
\end{subfigure}
\caption{Performance of \algname{Rand-DIANA} with Random Sparsification for different values of $p$.}
\label{fig:rand_diana_p}
\end{figure}
\fi

\iffalse
\subsection{\algname{VR-GDCI} vs \DIANA for Single Node}
Next, we employ the same set-up as in Section~\ref{section:exp} for $A\in \bR^{150 \times 100}$ and 10 workers.
\begin{figure*}[h]
\centering
\begin{subfigure}{\textwidth}
    \includegraphics[width=0.5\linewidth]{plots/single_node/iterates/gdci-randk_bits.pdf}
    \includegraphics[width=0.5\linewidth]{plots/single_node/iterates/gdci-nat_bits.pdf}
\end{subfigure}
\caption{\textbf{Left plot}: \DIANA and \algname{VR-GDCI} with different $q$.
\textbf{Right plot}: Grid search for $s$ over $[20]$ of \DIANA and \algname{VR-GDCI} with optimal $s$.}
\label{fig:vr_gdci}
\end{figure*}
% \footenotetext{$[k] \eqdef \{2, \dots, k\}$ in this cases}

{\bf Observation: } The left plot in Figure~\ref{fig:vr_gdci} clearly shows the better experimental performance of \DIANA in comparison to \algname{VR-GDCI} for any level of Sparsification. We used
Natural Dithering compressors for the right plot. Interestingly, the performance of \algname{VR-GDCI} is almost the same for all values of $s$ - the number of dithering levels. Thus, \algname{VR-GDCI} can be significantly better for very small (e.g., $s \in \{2, 3, 4\}$) numbers of dithering levels.
\fi

\iffalse

\section{Back Matter}
There are a some final, special sections that come at the back of the paper, in the following order:
\begin{itemize}
  \item Author Contributions
  \item Acknowledgements
  \item References
\end{itemize}
They all use an unnumbered \verb|\subsubsection|.

For the first two special environments are provided.
(These sections are automatically removed for the anonymous submission version of your paper.)
The third is the ‘References’ section.
(See below.)

(This ‘Back Matter’ section itself should not be included in your paper.)

\begin{contributions} % will be removed in pdf for initial submission,
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
    Briefly list author contributions.
    This is a nice way of making clear who did what and to give proper credit.

    H.~Q.~Bovik conceived the idea and wrote the paper.
    Coauthor One created the code.
    Coauthor Two created the figures.
\end{contributions}
\fi

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    We would like to thank the anonymous reviewers, Laurent Condat and Konstantin Mishchenko for their helpful comments and suggestions to improve the  manuscript.
    % Briefly acknowledge people and organizations here.

    % \emph{All} acknowledgements go in this section.
\end{acknowledgements}

% \newpage

\bibliography{shulgin_106}

% \appendix
% NOTE: necessary when ptmx or no mathfont class option is given

\end{document}
