\section{Experiments}
\label{sec:experiments}

%\begin{figure}[t]
%\centering
%\includegraphics[width=\linewidth]{UAI/figures/dme_linf.pdf}  
%\caption{Distributed mean estimation for scalar data. See text for details.}
%\label{fig:dme_single_comparison}
%\end{figure}

We evaluate the MVU mechanism on two sets of experiments: Distributed mean estimation and federated learning. Our goal is to demonstrate that MVU can attain a better privacy-utility trade-off at low communication budgets compared to other private compression mechanisms.
Code to reproduce our results can be found in the repo \url{https://github.com/facebookresearch/dp_compression}.

\subsection{Distributed mean estimation}
\label{sec:dme}

In distributed mean estimation (DME), a set of $n$ clients each holds a private vector $\bx_i \in \mathbb{R}^d$, and the server would like to privately estimate the mean $\bar{\bx} = \frac{1}{n} \sum_{i=1}^n \bx_i$.

\paragraph{Scalar DME.} We first consider the setting of scalar data, \emph{i.e.}, $d=1$. For a fixed value $x \in [-1,1]$, we set $\bx_i = x$ for all $i=1,\ldots,n$ with $n=100,000$ and then privatize them before taking average. We measure the squared difference between the private estimate and $\bar{\bx} = x$, which is coincidentally the variance of the mechanism at $x$. The baseline mechanisms that we evaluate against are (unbiased) Bitwise Randomized Response (bRR), (unbiased) Generalized Randomized Response (gRR), the communication-limited local differentially private (CLDP) mechanism~\citep{girgis2021shuffled}, and the Laplace mechanism without any compression. The CLDP mechanism uses a fixed communication budget of $b=1$, whereas for bRR and gRR we set $b=3$, and for MVU we set $b=1,3$. %\mike{Could a reviewer argue that this isn't fair to CLDP since the other methods get more bits? Would it be too crowded to show MVU with both $b=1$ and 3? At least we should remind the reader that CLDP doesn't offer the flexibility to select different bit rates.}

Figure \ref{fig:dme_scalar_comparison} shows the plot of input value $x$ vs. variance of the private mechanism at $x$.
%As expected, all mechanisms become more accurate with lower variance as $\epsilon$ increases.
Interestingly, MVU with $b=1$ recovers the CLDP mechanism for $\epsilon=1,3,5$, while MVU with $b=3$ is consistently the lowest variance private compression mechanism. For larger $\epsilon$, it is evident that the variance of both gRR and MVU are comparable or even slightly lower that of the Laplace mechanism, even when compressing to only $b=3$ bits in their output.

\begin{figure*}[t]
    \centering
    \begin{subfigure}{.49\textwidth}
      \centering
      \includegraphics[width=\linewidth]{UAI/figures/dme_l1.pdf}  
    \end{subfigure}
    \begin{subfigure}{.49\textwidth}
      \centering
      \includegraphics[width=\linewidth]{UAI/figures/dme_l2.pdf}  
      \end{subfigure}
    \caption{Distributed mean estimation for $n=10,000$ data vectors with $L_1$- (left) and $L_2$-sensitivity (right). Error bars represent standard deviation across $10$ repeated runs with different private vectors. Methods that are $(\epsilon,\delta)$-DP use the same value of $\delta=1/(n+1)$. The MVU mechanism can attain an MSE close to that of the Laplace and Gaussian mechanisms while compressing the output to only $b=3$ bits per coordinate. }
    \label{fig:dme_comparison}
\end{figure*}

\paragraph{Vector DME.} We next look at vector data with $d=128$ and $n=10,000$. We draw the sensitive vectors from two distinct distributions\footnote{We intentionally avoided zero-mean distributions since some of the private mechanisms converge to the all-zero vector as $\epsilon \rightarrow 0$.}: (i) Uniform at random from $[0,1]^d$ and then normalize to $L_1$-norm of 1; and (ii) Uniform over the spherical sector $\mathbb{S}^{d-1} \cap \mathbb{R}_{\geq 0}^d$. In these settings, the vectors $\bx_i$ have $L_1$- and $L_2$-sensitivity of 1, respectively.

For baselines, we consider the CLDP mechanism~\citep{girgis2021shuffled}, the Skellam mechanism~\citep{agarwal2021skellam}, the Laplace mechanism (for setting (i) only), and the Gaussian mechanism (for setting (ii) only). Both the Skellam and the Gaussian mechanisms are $(\epsilon,\delta)$-DP for $\delta > 0$. For a given $\epsilon>0$, we set $\delta = 1/(n+1)$ and choose the noise parameter $\mu$ for the Skellam mechanism using the optimal RDP conversion, and the noise parameter $\sigma$ for the Gaussian mechanism using the analytical conversion in \citet{balle2018improving}.
For communication budget, we set $b=3$ for MVU and $b=16$ for Skellam (which requires a large $b$ in order to prevent truncation error). The CLDP mechanism does not allow flexible selection of communication budget, and instead outputs a \emph{total} number of $\log_2(d) + 1$ bits for the $L_1$-sensitivity setting, and $b = \log_2(d) + 1 = 8$ bits \emph{per coordinate} for the $L_2$-sensitivity setting. See Appendix \ref{sec:experiment_details} for a more detailed explanation.
%of hyperparameter choice.

Figure \ref{fig:dme_comparison} shows the mean squared error (MSE) for privately estimating $\bar{\bx}$ across different values of $\epsilon$. In the left plot corresponding to the $L_1$-sensitivity setting, MVU can attain MSE close to the Laplace mechanism at a greatly reduced $b=3$ bits per coordinate. In comparison, CLDP and Skellam attain MSE that is more than an order of magnitude higher than Laplace.

The right plot corresponds to $L_2$-sensitivity. Here, the MVU mechanism (dark green line) is significantly less competitive than the baselines. This is because the $L_2$-metric DP constraint for the MVU mechanism forces rows of the sampling probability matrix $P$ to be near-identical, hence is near-singular and does not admit a well-conditioned unbiased solution.
To address this problem, we instead optimize the MVU mechanism to satisfy $L_1$-metric DP and use the R\'{e}nyi accounting in Section \ref{sec:composition} to compute its RDP guarantee, then apply RDP-to-DP conversion to give an $(\epsilon,\delta)$-DP guarantee at $\delta=\frac{1}{n+1}$. The light green line shows the performance of the $L_1$-metric DP mechanism, which now slightly outperforms both CLDP and Skellam at a much lower communication budget of $b=3$. These results demonstrate that the MVU mechanism attains better utility vs. compression trade-off for vector data as well. 

\begin{figure*}[t]
    \centering
    \begin{subfigure}{.49\textwidth}
      \centering
      \includegraphics[width=\linewidth]{UAI/figures/dpsgd_mnist_linear.pdf}  
    \end{subfigure}
    \begin{subfigure}{.49\textwidth}
      \centering
      \includegraphics[width=\linewidth]{UAI/figures/dpsgd_cifar_linear.pdf}  
      \end{subfigure} 
\caption{DP-SGD training with Gaussian mechanism, stochastic signSGD and MVU mechanism on MNIST (left) and CIFAR-10 (right). Each point corresponds to a single hyperparameter setting, and dashed line shows Pareto frontier of privacy-utility trade-off. MVU mechanism outperforms signSGD at the same communication budget of $b=1$.}
\label{fig:dpsgd_comparison}
\end{figure*}

\begin{comment}
\begin{figure*}[t]
\centering
\begin{minipage}{.65\textwidth}
\begin{subfigure}{.5\textwidth}
  \centering
  \includegraphics[width=\linewidth]{UAI/figures/dme_l1.pdf}  
\end{subfigure}
\begin{subfigure}{.5\textwidth}
  \centering
  \includegraphics[width=\linewidth]{UAI/figures/dme_l2.pdf}  
  \end{subfigure}
\caption{Distributed mean estimation for $n=10,000$ data vectors with $L_1$- (left) and $L_2$-sensitivity (right). Error bars represent standard deviation across $10$ repeated runs with different private vectors. Methods that are $(\epsilon,\delta)$-DP use the same value of $\delta=1/(n+1)$. The MVU mechanism can attain an MSE close to that of the Laplace and Gaussian mechanisms while compressing the output to only $b=3$ bits per coordinate. }
\label{fig:dme_comparison}
\end{minipage}
\hspace{1ex}
\begin{minipage}{.33\textwidth}
\centering
\includegraphics[width=\linewidth]{UAI/figures/dpsgd_mnist_linear.pdf}  
\caption{DP-SGD training on MNIST with Gaussian mechanism, stochastic signSGD and MVU mechanism. Each point corresponds to a single hyperparameter setting, and dashed line shows Pareto frontier of privacy-utility trade-off.}
\label{fig:dpsgd_comparison}
\end{minipage}
\end{figure*}
\end{comment}

\subsection{Private SGD}
\label{sec:fl}

Federated learning~\citep{mcmahan2017communication} often employs DP to protect the privacy of the clients' updates. We next evaluate the MVU mechanism for this use case and show that it can serve as a drop-in replacement for the Gaussian mechanism for FL protocols, providing similar DP guarantees for the client update while reducing communication.

In detail, for MNIST and CIFAR-10~\citep{krizhevsky2009learning}, we train a linear classifier on top of features extracted by a scattering network~\citep{oyallon2015deep} similar to the one used in \citet{tramer2020differentially}; see Appendix \ref{sec:experiment_details} for details.
The base private learning algorithm is DP-SGD with Gaussian gradient perturbation~\citep{abadi2016deep} and R\'{e}nyi-DP accounting. The private compression baselines are the MVU mechanism with budget $b=1$ and stochastic signSGD~\citep{jin2020stochastic} -- a specialized private gradient compression scheme for federated SGD that applies the Gaussian mechanism and outputs its coordinate-wise sign. Similar to the distributed mean estimation experiment with $L_2$-sensitivity, we optimize the MVU mechanism to satisfy $L_1$-metric DP and then compute its R\'{e}nyi privacy guarantee as in Section \ref{sec:composition}.

Figure \ref{fig:dpsgd_comparison} shows the privacy-utility trade-off curves. We sweep over a grid of hyperparameters (see Appendix \ref{sec:experiment_details} for details) for each mechanism and plot the resulting $\epsilon$ and test accuracy as a point in the scatter plot. The dashed line is the Pareto frontier of optimal privacy-utility trade-off. The result shows that MVU mechanism outperforms signSGD---a specially-designed gradient compression mechanism for federated learning---at nearly all privacy budgets with the same communication cost of \emph{one bit per coordinate}. We include an additional result for a small convolutional network in Appendix \ref{sec:experiment_details}, where we observe similar findings.
% \mike{In the scatter$+$linear experiment, if we run MVU with $b=2$, how much better is the tradeoff compared to $b=1$? If $b=2$ is nearly as good as Gaussian, that could be interesting, showing that if you can tolerate twice the bandwidth, you can nearly match the infinite-bit Gaussian mechanism (at least, at lower $\epsilon$).} \cg{Not noticeably better. I think we need central DP level of noise before unbiasedness becomes critical for model performance.} \mike{Got it, makes sense. Don't need to worry about it now, I would be curious if there's any communication budget for MVU that gets performance closer to the Gaussian mechanism in that setting.}
