\section{Experimental Details}
\label{sec:experiment_details}

\subsection{Vector dithering}

The optimization program in the MVU Mechanism operates on numbers on a discrete grid, which are obtained by dithering. In the scalar case, we use the standard dithering procedure on an $x$ in $[0, 1]$. For vectors, we use coordinate-wise dithering on each coordinate. While this leads to an unbiased solution, it might increase the norm of the vector. We show below that the increase in norm is not too high. 

\begin{lemma} \label{lem:vector_dithering}
Let $v$ be a vector such that $\|v\| \leq 1$ and $v_i \in [-1, 1]$ for each coordinate $i$. Let $v'$ be the vector obtained by dithering each coordinate of $v$ to a grid of size $B$ (so that the difference between any two grid points is $\Delta = \frac{2}{B-1}$). Then, with probability $\geq 1 - \delta$,
\[ \|v'\|^2 \leq \|v\|^2 +  \sqrt{2} \|v\| \Delta \log(4/\delta) + d \Delta^2/4 + \sqrt{2 d} \Delta \log (4/\delta).   \]
\end{lemma}

\begin{proof}
Let $\Delta = \frac{2}{B-1}$ be the difference between any two grid points. For a coordinate $i$, let $v_i = \lambda_i + a_i$ where $\lambda_i$ is the 
closest grid point that is $\leq v_i$ and $a_i \geq 0$. We also let $v'_i = \lambda_i + Z_i$; observe that by the dithering algorithm, $Z_i \in \{0, \Delta\}$, with $\bbE[Z_i] = a_i$. Additionally, $Var(Z_i) \leq \frac{\Delta^2}{4}$. 

Additionally, we observe that $\|v'_i\|^2 = \sum_{i} (\lambda_i + Z_i)^2 = \sum_i \lambda_i^2 + 2 \lambda_i Z_i + Z_i^2$. 
By algebra, we get that:
\[ \|v'\|^2 - \|v\|^2 = \sum_{i}(Z_i^2 - a_i^2) + \sum_{i} 2 \lambda_i (Z_i - a_i) \]
We next bound these terms one by one.
To bound the second term, we observe that $\bbE[Z_i] = a_i$ and apply Hoeffding's inequality. This gives us: 
\[ \Pr(\sum_i \lambda_i Z_i \geq \sum_i \lambda_i a_i + t) \leq 2 e^{-t^2 / 2 \sum_i \lambda_i^2 \Delta^2} \]
Plugging in $t = \sqrt{2 \sum_i \lambda_i^2} \Delta \log(4/\delta)$ makes the right hand side $\leq \delta/2$. To bound the first term, we again use a Hoeffding's inequality. 
\[ \Pr(\sum_i Z_i^2 \geq \sum_i \bbE[Z_i^2] + t) \leq 2e^{-t^2/2d \Delta^2} \]
Plugging in $t = \sqrt{2 d} \Delta \log (4/\delta)$ makes the right hand side $\leq \delta/4$. Therefore, with probability $\geq 1 - \delta$, 
\begin{align*}
    \|v'\|^2 \leq \|v\|^2 + \sqrt{2 \sum_i \lambda_i^2} \Delta \log(4/\delta) + \sum_i (\bbE[Z_i]^2 - a_i^2)  \\ +  \sqrt{2 d} \Delta \log (4/\delta).
\end{align*}
Observe that $\bbE[Z_i^2] - a_i^2 = Var(Z_i) \leq \Delta^2/4$; additionally, $\sum_i \lambda_i^2 \leq \|v\|^2$. Therefore, we get:
\[ \|v'\|^2 \leq \|v\|^2 + \sqrt{2} \|v\| \Delta \log(4/\delta) + d \Delta^2/4 + \sqrt{2 d} \Delta \log (4/\delta). \]

The lemma follows. 
\end{proof}

In practice, given an \emph{a priori} norm bound $\| v \| \leq R$ for all input vectors $v$, we estimate a scaling factor $\gamma \in [0, 1]$ and apply dithering to the input $\gamma v$ so that $\| \Dither(\gamma v) \| \leq R$ with high probability. This can be done by choosing a confidence level $\delta > 0$ and solving for $\sup \{\gamma \in [0,1] : \| \Dither(\gamma v) \| \leq R \text{ w.p. } \geq 1-\delta\}$ via binary search. Since dithering is randomized, we can perform rejection sampling until the condition $\| \Dither(\gamma v) \| \leq R$ is met. Doing so incurs a small bias that is insignificant in practical applications. We leave the design of more sophisticated vector dithering techniques that simultaneously preserve unbiasedness and norm bound for future work.

\subsection{Connection between DP and compression}

We highlight an interesting effect on the required communication budget as a result of adding differentially private noise. Figure \ref{fig:p_samples_comparison} shows the optimized sampling probability matrix $P$ for the MVU mechanism with a fixed input quantization level $\bin=5$ and various values of $\bout$. As $\bout$ increases, the overall structure in the matrix $P$ remains nearly the same but becomes more refined. Moreover, in the bottom right plot, it is evident that the marginal benefit to MSE becomes lower as $\bout$ increases. This observation suggests that for a given $\epsilon$, having more communication budget is eventually not beneficial to aggregation accuracy since the amount of information in the data becomes obscured by the DP mechanism and hence requires fewer bits to communicate.

\subsection{Distributed mean estimation}

For the vector distributed mean estimation experiment in Section \ref{sec:dme}, the different private compression mechanisms used different values of the communication budget $b$. We justify the choice of $b$ as follows.

\paragraph{$L_1$-sensitivity setting.} CLDP outputs a \emph{total} number of $\log_2(d) + 1 = 8$ bits, which is lower than that of both Skellam and MVU and cannot be tuned. Skellam performs truncation to the range $\{-2^{b-1}, 2^{b-1}-1\}$ after perturbing the quantized input with Skellam noise, and hence requires a value of $b$ that is large enough to prevent truncation error. We intentionally afforded Skellam a large budget of $b=16$ so that truncation error rarely occurs, and show that even in this setting MVU can outperform Skellam in terms of estimation MSE. For MVU, we chose $\bin=9$, which is the minimum value required to avoid a large quantization error, and $b=\bout=3$.

\paragraph{$L_2$-sensitivity setting.} CLDP uses a communication budget of $b=\log_2(d)+1=8$ \emph{per coordinate} and is not tunable. We used the same $b=16$ budget for Skellam as in the $L_1$-sensitivity setting. For MVU, we chose $\bin=5$ and $b=\bout=3$ for both the $L_1$- and $L_2$-metric DP versions, which results in a communication budget that is lower than both CLDP and Skellam. For the $L_1$-metric DP version, we found that optimizing MVU to satisfy $(\epsilon/2)$-metric DP with respect to the $L_1$ metric results in an $(\epsilon',\delta)$-DP mechanism with $\epsilon' \approx \epsilon$ and $\delta=1/(n+1)$ after optimal RDP conversion.

\begin{table*}[t]
    \begin{minipage}{.5\linewidth}
      \centering
        \resizebox{\textwidth}{!}{
        \begin{tabular}{ll}
            \toprule
            \textbf{Layer} & \textbf{Parameters} \\
            \midrule
            ScatterNet & Scale $J=2$, $L=8$ angles, depth 2 \\
            GroupNorm~\citep{wu2018group} & 6 groups of 24 channels each \\
            Fully connected & 10 units \\
            \bottomrule
        \end{tabular}
        }
        \caption{Architecture for scatter + linear model.}
        \label{tab:scatter_linear}
    \end{minipage}%
    \begin{minipage}{.5\linewidth}
      \centering
        \resizebox{\textwidth}{!}{
        \begin{tabular}{ll}
            \toprule
            \textbf{Layer} & \textbf{Parameters} \\
            \midrule
            Convolution $+\tanh$ & 16 filters of $8 \times 8$, stride 2, padding 2 \\
            Average pooling & $2 \times 2$, stride 1 \\
            Convolution $+\tanh$ & 32 filters of $4 \times 4$, stride 2, padding 0 \\
            Average pooling & $2 \times 2$, stride 1 \\
            Fully connected $+\tanh$ & 32 units \\
            Fully connected $+\tanh$ & 10 units \\
            \bottomrule
        \end{tabular}
        }
        \caption{Architecture for convolutional network model.}
        \label{tab:convnet}
    \end{minipage} 
\end{table*}

\begin{table*}[t]
    \begin{minipage}{.47\linewidth}
    \centering
    \resizebox{\textwidth}{!}{
    \begin{tabular}{ll}
        \toprule
        \textbf{Hyperparameter} & \textbf{Values} \\
        \midrule
        Batch size & $600$ \\
        Momentum & $0.5$ \\
        \# Iterations $T$ & $500,1000,2000,3000,5000$ \\
        Noise multiplier $\sigma$ for Gaussian and signSGD & $0.5, 1, 2, 3, 5$ \\
        $L_1$-metric DP parameter $\epsilon$ for MVU & $0.25, 0.5, 0.75, 1, 2, 3, 5$ \\
        Step size $\rho$ & $0.01, 0.03, 0.1$ \\
        Gradient norm clip $C$ & $0.25, 0.5, 1, 2, 4, 8$ \\
        \bottomrule
    \end{tabular}
    }
    \caption{Hyperparameters for DP-SGD on MNIST.}
    \label{tab:hyp_mnist}
    \end{minipage}%
    \begin{minipage}{.53\linewidth}
    \centering
    \resizebox{\textwidth}{!}{
    \begin{tabular}{ll}
        \toprule
        \textbf{Hyperparameter} & \textbf{Values} \\
        \midrule
        Batch size & $500$ \\
        Momentum & $0.5$ \\
        \# Iterations $T$ & $1000,2000,3000,5000,10000,15000$ \\
        Noise multiplier $\sigma$ for Gaussian and signSGD & $0.5, 1, 2, 3, 5$ \\
        $L_1$-metric DP parameter $\epsilon$ for MVU & $0.25, 0.5, 0.75, 1, 2, 3, 5$ \\
        Step size $\rho$ & $0.01, 0.03, 0.1$ \\
        Gradient norm clip $C$ & $0.25, 0.5, 1, 2, 4, 8$ \\
        \bottomrule
    \end{tabular}
    }
    \caption{Hyperparameters for DP-SGD on CIFAR-10.}
    \label{tab:hyp_cifar}
    \end{minipage}
\end{table*}

\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{UAI/figures/dpsgd_mnist_cnn.pdf}  
\caption{DP-SGD training of a small convolutional network on MNIST with Gaussian mechanism, stochastic signSGD and MVU mechanism. Each point corresponds to a single hyperparameter setting, and dashed line shows Pareto frontier of privacy-utility trade-off.}
\label{fig:dpsgd_comparison_convnet}
\end{figure}

\subsection{Private SGD}

In Section \ref{sec:fl}, we trained a linear model on top of features extracted by a scattering network\footnote{We used the Kymatio library \url{https://github.com/kymatio/kymatio} to implement the scattering transform.} on the MNIST dataset. In addition, we consider a convolutional network with $\tanh$ activation, which has been found to be more suitable for DP-SGD~\citep{papernot2020tempered}. We give the architecture details of both models in Tables \ref{tab:scatter_linear} and \ref{tab:convnet}.

\paragraph{Hyperparameters.} DP-SGD has several hyperparameters, and we exhaustive test all setting combinations to produce the scatter plots in Figures \ref{fig:dpsgd_comparison} and \ref{fig:dpsgd_comparison_convnet}. Tables \ref{tab:hyp_mnist} and \ref{tab:hyp_cifar} give the choice of values that we considered for each hyperparameter.

\paragraph{Result for convolutional network.} Figure \ref{fig:dpsgd_comparison_convnet} shows the comparison of DP-SGD training with Gaussian mechanism, stochastic signSGD, and MVU mechanism with $b=1$. The experimental setting is identical to that of Figure \ref{fig:dpsgd_comparison} except for the model being a small convolutional network trained end-to-end. We observe a similar result that MVU recovers the performance of signSGD at equal communication budget of $b=1$.