\section{Learning To Invert: Learning-based Gradient Inversion Attacks}
\subsection{Problem Set-Up}
\textbf{Motivation.} The threat of gradient inversion attack has prompted prior work to employ defense mechanisms to mitigate this privacy risk in FL~\citep{zhu2019deep, jeon2021gradient}. Intuitively, such defenses reduce the amount of information contained in the gradient about the training sample by either perturbing the gradient with noise~\citep{abadi2016deep} or compressing them~\citep{aji2017sparse, bernstein2018signsgd}, making recovery much more difficult. However, doing so also reduces the amount of information a sample can provide for training the global model, and hence has a negative impact on the model's performance. This is certainly true for principled defenses based on differential privacy~\citep{dwork2006calibrating} such as gradient perturbation~\citep{abadi2016deep}. However, defenses based on gradient compression seemingly provide a much better privacy-utility trade-off, effectively preventing the attack and reducing communication costs with minor reduction in model performance~\citep{zhu2019deep}. 
% In practical FL applications, gradient compression~\citep{aji2017sparse, bernstein2018signsgd} also leads to 

The empirical success of existing defenses seemingly diminish the threat of gradient inversion attacks in FL.
% , and hence enjoy widespread use in practical FL applications.
% especially since gradient compression~\citep{aji2017sparse, bernstein2018signsgd} is already commonplace in practical FL applications to reduce communication cost.
However, we argue that optimization-based attacks underestimate the power of the adversary: If the adversary has access to an auxiliary dataset $\calD_{\rm aux}$, they can train a \emph{gradient inversion model} to recover $\calD_{\rm aux}$ from its gradients computed on the global model.
%This approach---which we call \emph{Learning To Invert} (LTI)---is highly adaptable to the learning task, and any defense applied to the gradient can be treated as data augmentation for the gradient inversion model.
As we will establish later, this greatly empowers the adversary, exposing considerable risks to federate learning. 

%Previously, this threat may have been underestimated, because with optimization-based methods it is hard to leverage such side information. The adversary has to incorporate their knowledge of the data through handcrafted regularizers~\citep{geiping2020inverting} or modifications to the loss function~\citep{deng2021tag}. We claim that it seems more natural to assume that the adversary can use the data to \textit{learn} a model for gradient inversion.

\textbf{Threat model.} 
We consider the setting where the adversary is an \textit{honest-but-curious} server, who executes the learning protocol faithfully but aims to extract private training data from the observed gradients. 
Hence, in each FL iteration, the adversary has the knowledge of model weights $\bw$ and aggregated gradients.
Moreover, we assume the adversary has an auxiliary dataset $\calD_{\rm aux}$, which could be in-distribution or a mixture of in-distribution and out-of-distribution data.
This assumption is similar to the setting in \citet{jeon2021gradient}, which assumes a generative model that is trained from the in-distribution data, and is common in the study of other privacy attacks such as membership inference~\citep{shokri2017membership}.

In this paper, we focus on the attack against defense mechanisms ($\mathrm{DM}$) in prior work~\citep{zhu2019deep, jeon2021gradient}. Thus, we assume the adversary receives the aggregated gradients $\sum_{i=1}^B \mathrm{DM}\left[\nabla_{\bw}\ell(f_{\bw}(\bx_i), y_i)\right]$ at one of following $\mathrm{DM}$ settings:
\begin{enumerate}[leftmargin=*,nosep]
    \item \emph{Gradient without defense.} The gradient before the aggregation is the original gradient without any defense. Most previous papers focus on this common setting.
    \item \emph{Sign compression.}~\citep{bernstein2018signsgd} applies a element-wise sign function to gradient before the aggregation, which compresses the gradient to \emph{one bit per dimension}.
    \item \emph{Gradient pruning with pruning rate $\alpha$}~\citep{aji2017sparse} zeroes out the bottom $1-\alpha$ fraction of coordinates of $\nabla_{\bw}\ell(f_{\bw}(\bx), y)$ in terms of absolute value, which effectively compresses the gradient to $(1-\alpha) m$ dimensions, where $m$ denotes the model size. 
    \item \textit{Gradient perturbation with Gaussian standard deviation $\sigma$}~\citep{abadi2016deep} is a differentially private mechanism used commonly for training private models. A Gaussian random vector $\calN(\mathbf{0}, \sigma^2I)$ is added to the gradient, which one can show achieves $\epsilon$-local differential privacy~\citep{kasiviswanathan2011can} with $\epsilon = O(1/\sigma)$.
\end{enumerate}


% for each sample $(\bx,y)$ in the batch.

% Optimization-based gradient inversion attacks require great effort from the attacker when they are developed for different kinds of data. There are three aspects of effort:
% \begin{enumerate}[leftmargin=*,nosep]
%     \item It takes effort to manually design a suitable objective function to encode the gradient match, i.e. the distance between the true data gradient and the gradient from the reconstructed data. Different objective functions lead to very different attack result. For example, \citep{geiping2020inverting} shows that the cosine distance is much better than the euclidean distance.
%     \item The encoding of data prior knowledge needs careful design. \citep{geiping2020inverting, yin2021see, jeon2021gradient} carefully encode the image prior to their objective functions, by the total variation regularization, batch normalization stats and generative models. \citep{deng2021tag} optimizes the word embedding instead of the discrete word token for language tasks.
%     \item The optimization process may have issues such as getting stuck in local minima, divergence, etc. For instance, the optimization in \citep{zhu2019deep} is very unstable and needs several restart to get a good solution in the end.
% \end{enumerate}
% Our learning-based gradient inversion attack introduced next bypasses all these inconvenience with the supervision of the auxilary dataset $\calD_{\rm aux}$. 

\subsection{Learning to invert (LTI)}
\label{sec: method}
\textbf{Definition of the learning problem.} 
Having knowledge of the model weights and the defense mechanism $\mathrm{DM}$, the adversary is able to generate the gradient $\mathrm{grad}_{S_B}^{\rm DM}=\sum_{i=1}^B\mathrm{DM}\left[\nabla_{\bw}\ell(f_{\bw}(\bx^{\rm aux}_i), y^{\rm aux}_i)\right]$ for any batch of samples $S_B = \{(\bx^{\rm aux}_1, y^{\rm aux}_1)\cdots (\bx^{\rm aux}_B, y^{\rm aux}_B)\} $ in the auxiliary dataset.
This allows the adversary to learn a \emph{gradient inversion model} $g_{\theta}: \bbR^m\to \bbR^{B\times d}$ ($d$ denotes data dimension), parameterized by $\theta$ , to predict this batch of data point $S_B$ from the aggregated gradient $\mathrm{grad}_{S_B}^{\rm DM}$.
The learning goal is to minimize the reconstruction error $\ell^{attack}$ of $g_{\theta}$ on the auxiliary dataset $\calD^{\rm aux}$:
\begin{equation}
    \label{eq:obj}
%    \min_{\theta}\sum_{(\bx^{\rm aux}, y^{\rm aux})\in\calD^{\rm aux}}\ell^{ attack}\left(g_{\theta}\left(\mathrm{grad}\right), (\bx^{\rm aux}, y^{\rm aux}) \right),
    \min_{\theta}\bbE_{S_B\sim\calD^{\rm aux}}\ell^{ attack}\left(g_{\theta}\left(\mathrm{grad}_{S_B}^{\rm DM}\right), S_B \right).
    \end{equation}
% where $S_B\sim\calD^{\rm aux}$ means each data point in $S_B$ is independently uniformly sampled from the auxiliary dataset $\calD^{\rm aux}$.

We hereby explain the choice of the loss function $\ell^{attack}$ and the inversion model $g_{\theta}$.
Since $g_{\theta}$ needs to reconstruct data in batches, $\ell^{attack}$ should be permutation invariant w.r.t. the $S_B$.
A common solution \citep{zhang2019deep} is to define $\ell^{attack}$ as:
\begin{align}
\label{eq:obj_single}
	&\ell^{attack}\left(g_{\theta}\left(\mathrm{grad}_{S_B}^{\rm DM}\right), S_B \right)\nonumber \\
	&= \min_{\pi}\sum_{i=1}^B \ell^{attack}_{single}
	\left(\left(\mathrm{grad}_{S_B}^{\rm DM}\right)_i, \left(\bx^{\rm aux}_{\pi(i)}, y^{\rm aux}_{\pi(i)}\right)\right),
\end{align}
where the minimization is over all possible permutation $\pi$. 
$\ell^{attack}_{single}$ is the loss function for a single pair of the prediction and target data.
In practice, $\ell^{attack}_{single}$ can be a cross-entropy loss for discrete inputs or a L2 loss for continuous-valued inputs.
As for the choice of the inversion model $g_{\theta}$, we empirically find that a multi-layer perceptron (MLP)~\citep{bishop1995neural} is sufficiently effective for the tasks in our experiments.

\textbf{Comparison to optimization-based attacks.} LTI is superior in its simplicity to optimization-based methods on generalization, for the following two aspects.
Firstly, LTI doesn't explicitly have any terms relevant to data prior. It will learn the data property from the auxiliary dataset. 
However, optimization-based attacks usually manually encode the data prior in their objective functions, e.g. the total variation term in most optimization-based attacks to reconstruct image samples.
Secondly, there's no need for careful adaptation to different defense mechanisms. 
% Secondly, when the gradient is applied any defense mechanism, we don't need any further careful adaptation of the learning problem. 
As we know, in optimization-based attacks, for any FL defense mechanism, it is crucial to carefully design a corresponding objective function for gradient matching.
In \autoref{sec:exp}, we will show that our simple approach is surprisingly effective at circumventing existing defenses for both language and vision data.

%In practice, $\ell^{attack}$ can be cross-entropy for discrete inputs or mean-squared loss (for continuous-valued inputs), and we empirically find that a multi-layer perceptron (MLP)~\citep{bishop1995neural} for $g_{\theta}$ is sufficiently effective. Importantly, when a defense mechanism (e.g., gradient perturbation or gradient compression) is applied, we can apply the same transformation to $\nabla_{\bw}\ell(f_{\bw}(\bx^{\rm aux}), y^{\rm aux})$ to augment the training data for $g_\theta$ to carry out an \emph{adaptive attack}. We will show in \autoref{sec:exp} that this simple approach is surprisingly effective at circumventing existing defenses.
%As we demonstrate in \autoref{sec:exp}, this surprisingly simple approach is actually more versatile and powerful than existing optimization-based attacks. It can naturally handle many different data types and can even adapt to existing compression-based defenses, which can be readily applied to the auxiliary data (or generated gradients) during training. 
% With the learned model, the attacker can infer the private data $(\bx, y)$ by $g_{\theta}\left(\nabla_{\bw}\ell(f_{\bw}(\bx^{\rm aux}), y^{\rm aux})\right)$. The potential advantages of the learning-based gradient inversion are summarized as
% \begin{enumerate}[leftmargin=*,nosep]
% \item The model will learn to extract the useful information from gradient instead of manually designing the loss for gradient matching. 
% \item It naturally handles different type of data, such as discrete or continuous, by setting the loss function $\ell^{\rm att}$ as cross entropy loss or mean squared error, while the discrete data is much more unfriendly than the continuous data when designing optimization-based methods. 
% \item It inexplicitly learns the data domain so that when the exchanged gradient is heavily compressed or noisy, it can still have an accurate prediction located in the data domain. 
% Differently, the optimization-based method likely ends by some reconstructed data that produces mostly the same compressed or noisy gradient, but is far away from the data domain.  
% \end{enumerate}

\textbf{Dimensionality reduction for large models $f_{\bw}$.} One potential problem for LTI is that the gradients $\sum_{i=1}^B\mathrm{DM}\left[\nabla_{\bw}\ell(f_{\bw}(\bx^{\rm aux}_i), y^{\rm aux}_i)\right]$ can be extremely high-dimensional. For example, ResNet20~\citep{he2016deep} for vision tasks has $270K$ parameters and BERT~\citep{devlin-etal-2019-bert} for language tasks have approximately $110M$ trainable parameters. Such high-dimensional input to the model $g_\theta$ can lead to memory issues, as the first layer of the MLP would have $110M \times h$ parameters, where $h$ denotes the size of the first hidden layer.

%We will then meet the memory issue when design the architecture of $g_{\theta}$, because any first linear layer will cause $11M\times h$ million parameters, where the size of hidden neuron $h$ is necessary to be larger than $100$.

To address this issue, we use feature hashing~\citep{weinberger2009feature} to reduce the dimensionality of the input gradient.
In feature hashing, each gradient dimension $i\in[m]$ is randomly assigned to one of $k$ bins ($k$ is much smaller than the size of gradient $m$), formalized as $r(i)\in[k]$. 
We then sum up all gradient values in each bin, producing a compressed feature vector of size $k$. % is obtained for the inversion model $g_{\theta}$.
In other words, we project the aggregated gradient $\sum_{i=1}^B\mathrm{DM}\left[\nabla_{\bw}\ell(f_{\bw}(\bx^{\rm aux}_i), y^{\rm aux}_i)\right]$ to $P\left(\sum_{i=1}^B\mathrm{DM}\left[\nabla_{\bw}\ell(f_{\bw}(\bx^{\rm aux}_i), y^{\rm aux}_i)\right]\right)$ using the random projection matrix $P$ given by:
$$
P \in \{0, 1\}^{k\times m} s.t.~\forall i,~P_{j, i}=0~(\forall j\neq r(i)),~P_{r(i), i}=1.
$$
%Notice that because $r(i)$ is sampled, $P$ is a random matrix.
%Once $P$ is sampled, we fix it during the training and inference.
%The projection matrix $P$ can be viewed as a first fixed layer in the neural network that reduces the dimensional with random projections.
$P$ in the definition is a sparse matrix with $m$ nonzero element that can be saved in a memory efficient way.
% If $r(i)$ is implemented with a pseudo-uniform hashing function, $P$ does not need to be stored in memory.
In this way, $g_\theta$ 's the memory footprint can be reduced to a constant independent from the gradient dimension. 

%\chuan{What about removing the first/last layers?}

%\textbf{Dependence on in-distribution data.} Since training the inversion model $g_\theta$ requires access to an auxiliary dataset $\calD^{\rm aux}$, a natural question to ask is \emph{how well can the model generalize when $\calD^{\rm aux}$ does not accurately reflect a client's actual training data?} Such a setting can arise either if the client has outlier samples or if public in-distribution data is scarce.