\section{Experiment}
\label{sec:exp}
We evaluate LTI on both vision and language tasks.
The evaluation results demonstrate that it vastly outperforms prior gradient inversion attacks, especially when \emph{gradient defenses are applied}. 
Moreover, we show that LTI is able to perform surprisingly well even when the auxiliary data is out-of-distribution, which makes LTI more applicable in the real scenario\footnote{Our code is released at \url{https://github.com/wrh14/Learning_to_Invert}.}.
%the success of LTI is robust to both size and distribution shift of the auxiliary dataset $\calD_{\rm aux}$.

%\subsection{Experiment Setup}
%
%\subsubsection{Defense Mechanisms}
%We consider the following defense mechanisms evaluated in prior work~\citep{zhu2019deep, jeon2021gradient}:
%\begin{itemize}[leftmargin=*,nosep]
%    % \item \emph{None.} The gradient shared between the server and clients is the full gradient without any defense. This is the most common setting that previous papers focus on.
%    \item \emph{Gradient aggregation with $B$ clients.}~\citep{bonawitz2016practical} uses a secure protocol to aggregate the gradients from clients, and only sends the averaged gradients to the server.
%    \item \emph{Sign compression}~\citep{bernstein2018signsgd} applies a element-wise sign function to the gradients, and compress the gradient to \emph{one bit per element}.
%    \item \emph{Gradient pruning with pruning rate $\alpha$}~\citep{aji2017sparse} zeroes out the bottom $1-\alpha$ fraction of coordinates of $\nabla_{\bw}\ell(f_{\bw}(\bx), y)$ in terms of absolute value, which effectively compresses the gradient to $(1-\alpha) m$ dimensions.
%    \item \textit{Gradient perturbation with Gaussian standard deviation $\sigma$}~\citep{abadi2016deep} is a differentially private mechanism used commonly for training private models with SGD. An i.i.d. Gaussian random vector $\calN(0, \sigma^2)$ is added to the gradient, which one can show achieves $\epsilon$-local differential privacy~\citep{kasiviswanathan2011can} with $\epsilon = O(1/\sigma)$.
%\end{itemize}
%Under the 4 gradient settings above, we are going to make our experiment on both the vision data and the language data. 
%The data of these two tasks are very different: the images are continuous, while the language sequences lie in the discrete token space.
%Significant results on both vision and language data will show the generality our method cross different data and tasks.

\subsection{Evaluation on Vision Task}
\label{sec:exp_vision}
\paragraph{Federate learning tasks.} For evaluating LTI on vision tasks, we experiment with image classification on CIFAR10~\citep{krizhevsky2009learning} and the training loss is the cross-entropy loss. 
The \textit{original test split of CIFAR10} is used for FL training.
For the generalization propose, we test the attacks on two different architectures as the FL model $f_{\bw}$, which are LeNet~\citep{lecun1998gradient} and ResNet20~\citep{he2016deep} with $~15K$ and $~270K$ parameters.

\paragraph{Defense mechanisms set-up.} The adversary will receive the gradient aggregated from $B=1$ or $4$ clients, applied with no defense, sign compression, gradient pruning ($\alpha=0.99$), or Gaussian perturbation ($\sigma=0.1$).

\paragraph{Baselines.}
We compare our method with two gradient inversion attack baseline methods: \emph{Inverting Gradients} (IG; \citet{geiping2020inverting}), a representative optimization-based method with limited data prior, and \emph{Gradient Inversion with Generative Image Prior} (GI-GIP; \citet{jeon2021gradient}), the state-of-the-art optimization-based method that uses a generative model to encode the data prior. We make minor modifications to these attacks to adapt them to various defenses; see appendix for details. The threat model of LTI is most similar to GI-GIP since both use an auxiliary dataset to encode the data prior.

\paragraph{Set-up of LTI.} We introduce the training set-up of LTI. %Notice that this is different from the training set-up of federate learning tasks.
\begin{itemize}[leftmargin=*,nosep]
    \item \textit{Auxiliary dataset.} We use the \textit{original train split of CIFAR10} as the auxiliary dataset of the adversary. Notice that under this set-up, the auxiliary dataset is different from the dataset that the FL tasks are trained on, i.e. the one from which the aggregated gradients are computed.
    \item \textit{Inversion model architecture.} Our inversion model $g_{\theta}$ is a three-layer MLP with hidden size 3K or 10K upon the memory limitation.
    The MLP takes the flattened gradient vector as input and outputs a $B\times 3072$-dimensional vector representing the flattened images. 
    Because the size ResNet20 is large, we use feature hashing (see \autoref{sec: method}) to reduce the target model gradient to $50\%$ of its original dimensionality as input to the inversion model.
    \item \textit{Training details.} The training objective $\ell^{attack}_{single}$ in \autoref{eq:obj_single} is the mean squared error (MSE) between the output vector from MLP and the flattened ground truth image. We use the Adam~\citep{kingma2014adam} optimizer for training $g_\theta$. The model is trained for $200$ epochs using training batch size $256$. The initial learning rate is $10^{-4}$ with learning rate drop to $10^{-5}$ after $150$ epochs.
    \item \textit{Computation cost.} Our experiments are conducted using NVIDIA GeForce RTX 2080 GPUs and each training run takes about 1.5 hours.
\end{itemize}

\paragraph{Evaluation methodology.}
We evaluate LTI and the aforementioned baselines on $1,000$ random images from the CIFAR10 test split. 
To measure reconstruction quality, we use three common metrics: 1. \emph{Mean squared error} (MSE) measures the average pixel-wise (squared) distance between the reconstructed image and the ground truth image. 2. \emph{Peak signal-to-noise ratio} (PSNR) measures the ratio between the maximum image pixel value and MSE. 3. \emph{Learned perceptual image patch similarity} (LPIPS) measures distance in the features space of a VGG~\citep{simonyan2014very} model trained on ImageNet. 4. \emph{Structural similarity index measure} (SSIM) measures the perceived change in structural information 

% \begin{table*}[t!]
% \centering
%     \subfloat[The FL model $f_{\bw}$ is LeNet.]{
%     \centering	
%     \resizebox{0.95\linewidth}{!}{
%     \begin{tabular}{c|cccc|cccc}
%     \toprule
%      & \multicolumn{4}{c|}{$B=1$}& \multicolumn{4}{c}{$B=4$}\\
%      \midrule
%         \textbf{Defense}& \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.} & \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.}\\ 
%         \midrule
%         IG         & 0.022          & 0.116          & 0.138          & 0.150          & 0.105          & 0.265          & 0.169          & 0.206          \\
% GI-GIP      & \textbf{0.001} & 0.091          & 0.043          & 0.124          & \textbf{0.009} & 0.082          & 0.180          & 0.157          \\
% LTI (Ours) & 0.004          & \textbf{0.014} & \textbf{0.029} & \textbf{0.012} & 0.015          & \textbf{0.023} & \textbf{0.031} & \textbf{0.026}\\
%      \bottomrule
%     \end{tabular}%
%     }
%     }
%     \\
%     \subfloat[The FL model $f_{\bw}$ is ResNet20.]{
%     \centering	
%     \resizebox{0.95\linewidth}{!}{
%     \begin{tabular}{c|cccc|cccc}
%     \toprule
%      & \multicolumn{4}{c|}{$B=1$}& \multicolumn{4}{c}{$B=4$}\\
%      \midrule
%         \textbf{Defense}& \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.} & \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.}\\ 
%         \midrule
%         IG         & 0.120 & 0.154          & 0.171          & 0.133          & 0.125          & 0.272          & 0.195          & 0.123          \\
% GI-GIP      & 0.062                         & 0.099          & 0.238          & 0.233          & 0.086          & 0.236          & 0.231          & 0.229          \\
% LTI (Ours) & \textbf{0.018}                & \textbf{0.013} & \textbf{0.023} & \textbf{0.021} & \textbf{0.038} & \textbf{0.035} & \textbf{0.038} & \textbf{0.039}\\
        
%      \bottomrule
%     \end{tabular}%
%     }
%     }
%     \caption{MSE for baselines (IG and GI-GIP) and our method LTI on CIFAR10. Table (a) and (b) show the results for two different $f_{\bw}$ architectures LeNet and ResNet20 respectively. See text for details} %.} %Bring it back when needing more texts:  As shown in the table, neither IG nor GIGIP work well when the defense mechanism is applied, while our method has the power to break the privacy protection from the compression and randomness. The tables for PSNR and LPIPS in the appendix show the similar conclusion.
%     \label{tab:res_cv}
%     \vspace{-2ex}
% \end{table*}

\begin{table*}[t!]
\centering
\caption{MSE for baselines (IG and GI-GIP) and our method LTI on CIFAR10. As shown in the table, neither IG nor GIGIP works well when the defense mechanism is applied, while our method has the power to break the privacy protection from the compression and randomness.} %.} %Bring it back when needing more texts:  As shown in the table, neither IG nor GIGIP work well when the defense mechanism is applied, while our method has the power to break the privacy protection from the compression and randomness. The tables for PSNR and LPIPS in the appendix show the similar conclusion.
    \label{tab:res_cv}

    \resizebox{\linewidth}{!}{
    \begin{tabular}{c|c|cccc|cccc}
    \toprule
     \multirow{2}{*}{\textbf{FL model}} & \multirow{2}{*}{\textbf{Methods}} & \multicolumn{4}{c|}{$B=1$}& \multicolumn{4}{c}{$B=4$}\\
     \cmidrule{3-10}
         & & \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.} & \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.}\\ 
        \midrule
        \multirow{3}{*}{LeNet} & IG         & 0.022          & 0.116          & 0.138          & 0.150          & 0.105          & 0.265          & 0.169          & 0.206          \\
 & GI-GIP      & \textbf{0.001} & 0.091          & 0.043          & 0.124          & \textbf{0.009} & 0.082          & 0.180          & 0.157          \\
 & LTI (Ours) & 0.004          & \textbf{0.014} & \textbf{0.029} & \textbf{0.012} & 0.015          & \textbf{0.023} & \textbf{0.031} & \textbf{0.026}\\
 \midrule
\multirow{3}{*}{ResNet20}  &  IG         & 0.120 & 0.154          & 0.171          & 0.133          & 0.125          & 0.272          & 0.195          & 0.123          \\
 & GI-GIP      & 0.062                         & 0.099          & 0.238          & 0.233          & 0.086          & 0.236          & 0.231          & 0.229          \\
 & LTI (Ours) & \textbf{0.018}                & \textbf{0.013} & \textbf{0.023} & \textbf{0.021} & \textbf{0.038} & \textbf{0.035} & \textbf{0.038} & \textbf{0.039}\\
     \bottomrule
    \end{tabular}%
    }
     % \vspace{-2ex}
\end{table*}


\begin{figure*}[!t]
    \centering
    \includegraphics[width=\linewidth]{sections/figs/image_examples.pdf}
    \caption{Comparison of LTI with IG and GI-GIP for reconstructing 4 random images in CIFAR10 when the FL model is LeNet and $B=1$. Under sign compression, only LTI can partially reconstruct the images to recover the object of interest whereas both IG and GI-GIP fail to do so on most samples.}
    \label{fig:img_example}
\end{figure*}

\subsubsection{Main Results}

\paragraph{Quantitative evaluation.} 
\autoref{tab:res_cv} gives quantitative comparisons in the metric of MSE for IG, GI-GIP, and LTI against various defense mechanisms on CIFAR10; Tables of PSNR, LPIPS and SSIM are in the appendix due to space limit. When no defense mechanism is applied, GI-GIP achieves the best performance. It is not surprising because GI-GIP, explicitly encodes image-prior in an image generator, which is more tailored than LTI to image data.
However, when the gradient is augmented with a defense mechanism that is underexplored, both IG and GI-GIP have considerably worse performance with MSE close to or above $0.1$. By comparison, LTI outperforms both baselines significantly and consistently across all three defense mechanisms.
%\autoref{tab:res_cv} gives quantitative comparisons in the metric of MSE for IG, GI-GIP, and LTI against various defense mechanisms on CIFAR10; Tables of PSNR and LPIPS are in the appendix due to the space. When no defense is applied, GI-GIP achieves the best performance according to all three metrics, whereas LTI performs almost equally well in terms of MSE and close to that of IG in terms of PSNR and LPIPS. However, when the gradient is augmented with a defense mechanism, both IG and GI-GIP have considerably worse performance with MSE close to $0.1$. By comparison, LTI outperforms both baselines significantly and consistently across all three defense mechanisms.
%Although GI-GIP achieves the best performance when the attacker has the full gradients, all optimization-based baselines turn to be ineffective when the defense strategies are applied.
%Our method (LTI) shows the significant improvement comparing with the baselines on these gradient defense settings.
For example, under gradient perturbation with $\sigma=0.1$, which prior work believed is sufficient for preventing gradient inversion attacks~\citep{zhu2019deep, jeon2021gradient}, MSE can be as low as $0.012$ for LTI. Our result, therefore, provides considerable additional insight for the level of empirical privacy achieved by DP-SGD~\citep{abadi2016deep}, and suggests that the theoretical privacy leakage as predicted by DP $\epsilon$ may be tighter than previously thought.
These results validate that LTI has strong adaptation performance in various settings and can be a great baseline to show the vulnerability in those underexplored settings.

\paragraph{Qualitative evaluation.} \autoref{fig:img_example} shows 4 random CIFAR10 test samples and their reconstructions under different defense mechanisms when the FL model is LeNet and $B=1$. Without any defense in place, all three methods recover a considerable amount of semantic information about the object of interest, with both GI-GIP and LTI faithfully reconstructing the training sample. Under the sign compression defense, IG completely fails to reconstruct all 4 samples, while GI-GIP only successfully reconstructs the second image. In contrast, LTI is able to recover the semantic information in all 4 samples. Results for gradient pruning and gradient perturbation yield similar conclusions. More examples are given in the appendix.

\subsubsection{Ablation Studies for Auxiliary Dataset}
\label{sec:ablation_cv}

Since LTI learns to invert gradients using the auxiliary dataset, its performance depends on the quantity and quality of data available to the adversary. We perform ablation studies to better understand this dependence by changing the auxiliary dataset size and its distribution. All ablation studies are conducted in the setting where the FL model is LeNet and $B=1$.

\paragraph{Varying the auxiliary dataset size.}
We randomly subsample the CIFAR10 training set to construct auxiliary datasets of size $\{500, 5000, 15000, 25000, 35000, 45000, 50000\}$ and evaluate the performance of LTI under various defenses. \autoref{fig:ablation}(a) plots reconstruction MSE as a function of the auxiliary dataset size, which is monotonically decreasing as expected. Moreover, with just $5,000$ samples for training the inversion model (second point in each curve), the performance is nearly as good as when training using the full CIFAR10 training set.
Notably, even with the auxiliary dataset size as small as $500$, the reconstruction MSE is \emph{still lower than that of IG and GI-GIP} in \autoref{tab:res_cv}.
Corresponding figures for PSNR, LPIPS, and SSIM in the appendix show similar findings.

\begin{figure*}[t!]
    \centering
    \includegraphics[width=0.8\linewidth]{sections/figs/ablation_vision.pdf}
    \caption{Ablation studies on size and distribution of the auxiliary dataset $\calD_{\rm aux}$. Under both severe data size limitation (left) and data distribution shift ($\beta=0.01$; right), LTI is able to outperform both baselines in \autoref{tab:res_cv} when a defense is applied.}
    \label{fig:ablation}
\end{figure*}

\paragraph{Varying the auxiliary data distribution.}
Although access to a large set of in-distribution data may be unavailable in practice, the adversary may still collect out-of-distribution samples for the auxiliary dataset. This is beneficial for the adversary since a model learning on out-of-distribution samples may transfer its knowledge to in-distribution data as well. To simulate this scenario, we divide CIFAR10 into two halves with disjoint classes and construct the auxiliary dataset by combining a $\beta$ fraction of samples from the first half and a $1-\beta$ fraction of samples from the second half for $\beta \in \{0, 0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1\}$.
The target model $f_{\bw}$ is trained only on samples from the first half, and hence the auxiliary set has the exact same distribution as the target model's data when $\beta = 1$ and only has out-of-distribution data when $\beta=0$.

\autoref{fig:ablation}(b) shows reconstruction MSE as a function of $\beta$. We make the following observations: %; corresponding figures for PSNR, LPIPS and SSIM are given in the appendix
\begin{enumerate}[leftmargin=*,nosep]
    \item Even if the auxiliary dataset only contains $250$ in-distribution samples ($\beta = 0.01$; second point in each curve), MSE of the inversion model is \emph{still lower than that of the best baseline} in \autoref{tab:res_cv}. For example, with the sign compression defense, LTI attains an MSE of $\leq 0.02$, which is much lower than the MSE of $0.116$ for IG and $0.091$ for GI-GIP. 
    \item When the auxiliary dataset contains only out-of-distribution data ($\beta=0$), the inversion model has a very high reconstruction MSE. In the next paragraph, we will propose a data augmentation method to improve the out-of-distribution generalization.
\end{enumerate}

\begin{table}
\centering
\caption{MSE of LTI when the auxiliary dataset is out-of-distribution. LIT-OOD outperforms GI-GIP for all defense mechanism settings.}
	\label{tab:ood}
\resizebox{\linewidth}{!}{
	\begin{tabular}{c|cccc}
	\toprule
		& None & Sign Comp. & Grad. Prune. & Gauss. Pert.\\
	\midrule
		LTI-OOD & 0.015 & 0.036 & 0.045 & 0.029\\
		GI-GIP & 0.001 & 0.091 & 0.043 & 0.124\\
	\bottomrule
	\end{tabular}
	}
	
\end{table}

\paragraph{Out-of-distribution (OOD) auxiliary data.} We further consider the auxiliary dataset that only has out-of-distribution data. Suppose the auxiliary data are images of the second half classes in CIFAR10 and the target model $f_{\bw}$ is trained only on images from the first half (i.e. the setting of $\beta=0$ when studying the data distribution). Instead of performing LTI with only the out-of-distribution data, we further augment the auxiliary dataset with the following steps:
\begin{enumerate}[leftmargin=*,nosep]
	\item Convert OOD data into the frequency domain by the discrete cosine transform (DCT).
	\item Compute the mean and variance of OOD data in the DCT space.
	\item Sample new data from a Gaussian with the mean and variance computed in step 2.
	\item Convert new data back to the original image space.
\end{enumerate}
Then we can train LTI with the OOD data and the augmented data from the steps above and name this method as LTI-OOD.
Table \ref{tab:ood} presents its MSE.
By comparing it with baselines in Table \ref{tab:res_cv}, LTI-OOD is better or not worse than the baselines when the defense mechanisms are applied.
Although LTI-OOD is worse than GI-GIP when no defense mechanism is applied, this is fair because GI-GIP utilizes the in-distribution data and this is a stronger data assumption than LTI-OOD.

To better understand this data augmentation, we also test the data augmentation where we estimate a Gaussian in the original image space and the MSE will increase from 0.015 to 0.045 when no defense is applied.
We hypothesize this is because by fitting a Gaussian in the DCT domain, the frequency property as an image is kept so that the distribution is closer to the target image distribution.

\subsection{Evaluation on Language Task}
\label{sec:exp_language}

\paragraph{Federate learning tasks.} For the evaluation on language data, we consider two common language tasks: text classifier training and causal language model training\footnote{We follow the task setup and code in \hyperlink{https://github.com/JonasGeiping/breaching} {https://github.com/JonasGeiping/breaching}}. 

In the task of text classification, the classifier $f_{\bw}$ is the BERT model~\citep{devlin-etal-2019-bert} with \emph{frozen token embedding layer}. Fixing the token embedding layer is a common technique for language model fine-tuning~\citep{sun2019fine}, which also has privacy benefits since direct privacy leakage from the gradient magnitude of the token embedding layer can be prevented~\citep{fowl2022decepticons, gupta2022recovering}.
As a result, the trainable model contains about $86M$ parameters.
The BERT classier is trained on CoLA~\citep{warstadt2018neural} dataset using the cross-entropy loss.

In the task of causal language model, the language model $f_{\bw}$ is a three-layer transformer~\citep{vaswani2017attention} with \emph{frozen token embedding layer}.  The trainable model contains about $1.1M$ parameters.
We train the language model on WikiText~\citep{merity2016pointer}, where each training sample is limited to $L=16$ tokens and the language model is trained to predict the next token $\bx_l$ given $\bx_{:l-1}$ for $l=1,\ldots,L$ using the cross-entropy loss.


We set the \emph{original test split of CoLA / WikiText dataset} as the dataset for the FL training, i.e. the dataset that the attacks will be test on.

\paragraph{Defense mechanisms set-up.} The adversary will receive the gradient applied with no defense, sign compression, gradient pruning ($\alpha=0.99$) and gaussian perturbation ($\sigma=0.001$ for text classificatier training task and $\sigma=0.01$ for causal language model training task) when  $B=1$.

\paragraph{Baseline.}
We compare LTI with TAG~\citep{deng2021tag}---the state-of-the-art language model gradient inversion attack without utilizing the token embedding layer gradient\footnote{We do not compare against a more recent attack by \citet{gupta2022recovering} since it crucially depends on access to the token embedding layer gradient.}. 
The objective function for TAG is a slight modification of \autoref{eq:opt_objective} that uses both the $\ell_2$ and $\ell_1$ distance between the observed gradient and the gradient of dummy data. We also modify TAG slightly to adapt it to different defenses; see appendix for details.
%As \citet{gupta2022recovering} relies on the gradient of word embedding, it is not applicable here.
%See the set-up details in the appendix.
%Because the world embedding in our task setting is fixed, 

\paragraph{Set-up of LTI.} We follow the setup below for training the gradient inversion model $g_\theta$.
\begin{itemize}[leftmargin=*,nosep]
    \item \textit{Auxiliary dataset.} We use 8551 samples from the train split of CoLA or $\sim 1.8\times 10^5$ samples from the train split of Wikitext as the auxiliary dataset. 
    %In addition, we introduce a weaker variant of our attack that only assumes knowledge of the \emph{marginal token distribution} for the language model training data. Instead of using the WikiText train split as auxiliary data, we sample random tokens according to the marginal token distribution to generate \emph{pseudo-data} for training the inversion model. We show that this variant, which we denote LTI-P, can even outperform LTI with in-distribution auxiliary data due to access to infinite training data.
    \item \textit{Inversion model architecture.}
    For both FL tasks, we train a two-layer MLP with ReLU activation and first hidden-layer size $600$ and second hidden-layer size $1,000$. The inversion model outputs $L$ probability vectors each with size equal to the vocabulary size ($\sim 50,000$), and we train it using the cross-entropy loss to predict the $L$ tokens given the target model gradient.
    We use feature hashing (see \autoref{sec: method}) to reduce the target model gradient to $1\%$ or $10\%$ of its original dimensions as input to the inversion model when $f_{\bw}$ is BERT or three-layer transformer.
    \item \textit{Training details.} We use Adam~\citep{kingma2014adam} to train the inversion model over $100$ epochs with batch size $64$. Learning rates are selected separately for each defense from $\{10^{-3}, 10^{-4}, 10^{-5}\}$.
    \item \textit{Computation cost.} Our experiments are conducted using NVIDIA GeForce RTX 3090 GPUs and each training run takes about 3 hours.
\end{itemize}

\paragraph{Evaluation methodology.} We evaluate LTI and the TAG baseline on $1,000$ samples from each task. To measure the quality of inverted text from attacks, we use four metrics: 1. \emph{Accuracy$(\%)$} measures the average token-wise zero-one accuracy. 2. \emph{Rouge-1$(\%)$}, \emph{Rouge-2$(\%)$} and \emph{Rouge-L$(\%)$} measure the overlap of unigram, bigram, and length of longest common subsequence between the ground truth and the reconstructed text. 

We also check the reconstructed texts from both TAG and LTI to see how the semantic meaning of the text is recovered and analyze the type of reconstruction error. This part is put in the appendix.

\begin{table*}[t!]
\centering
\caption{Results for gradient inversion attack on two language tasks. The overall trend is remarkably consistent: in all 4 metrics, LTI significantly outperforms TAG across different settings (7 out of 8). This shows that our method is easily adapted and is able to achieve great attack performance.}
    \label{tab:res_nlp}
	\subfloat[{\normalsize Text classifier training on CoLA dataset.}]{
    \centering
    % \resizebox{0.78\linewidth}{!}{
    \begin{tabular}{c|cccc|cccc}
    \toprule
        \textbf{Defense}& \multicolumn{4}{c|}{\textbf{None}} & \multicolumn{4}{c}{\textbf{Sign Compression}} \\
        \midrule
         \textbf{Method}  & Acc.  & Rouge-1 & Rouge-2 & Rouge-L & Acc.  & Rouge-1 & Rouge-2 & Rouge-L  \\
        \midrule
          TAG &  $      8.38$ & $     51.23$ & $      6.88$ & $     29.35$&$      1.62$ & $      8.81$ & $      0.00$ & $      8.09$  \\
          LTI (Ours) &  $     61.87$&$     65.23$&$     44.46$&$     63.34$& $     63.89$&$     69.92$&$     49.79$&$     67.86$  \\
          \midrule
          LTI-OOD (Ours)&	$     52.03$&$     45.86$&$     29.46$&$     45.79$&$     50.77$&$     49.07$&$     30.86$&$     48.80$ \\
     \midrule
     \midrule
         \textbf{Defense}&  \multicolumn{4}{c|}{\textbf{Gradient Pruning} ($\alpha=0.99$)} & \multicolumn{4}{c}{\textbf{Gaussian Perturbation} ($\sigma=0.001$)} \\
        \midrule
         \textbf{Method}  & Acc.  & Rouge-1 & Rouge-2 & Rouge-L & Acc.  & Rouge-1 & Rouge-2 & Rouge-L  \\
        \midrule
         TAG &  $      5.69$ & $     43.30$ & $      6.90$ & $     26.96$&$      5.12$ & $     33.85$ & $      2.94$ & $     22.01$ \\
         LTI (Ours) & $     58.93$&$     60.12$&$     37.96$&$     58.17$&$     53.96$&$     53.09$&$     32.41$&$     52.35$\\
         \midrule 
         LTI-OOD (Ours) & $     38.68$&$     35.66$&$     23.11$&$     35.46$& $     37.96$&$     33.75$&$     21.85$&$     33.55$\\
     \bottomrule
    \end{tabular}%
    % }
	}
	\\
	\subfloat[{\normalsize Causal language model training on WikiText dataset.}]{
    \centering
    % \resizebox{0.78\linewidth}{!}{
    \begin{tabular}{c|cccc|cccc}
    \toprule
        \textbf{Defense}& \multicolumn{4}{c|}{\textbf{None}} & \multicolumn{4}{c}{\textbf{Sign Compression}} \\
        \midrule
        \textbf{Method}  & Acc.  & Rouge-1 & Rouge-2 & Rouge-L & Acc.  & Rouge-1 & Rouge-2 & Rouge-L  \\
        \midrule
         TAG &   $     74.13$ & $     71.92$ & $     50.64$ & $     68.46$& $    100.00$ & $    100.00$ & $    100.00$ & $    100.00$   \\
         LTI (Ours) &  $     89.61$&$     86.91$&$     80.68$&$     86.90$& $71.15$&$     64.35$&$     45.40$&$     64.29$  \\
         \midrule 
         LTI-OOD (Ours) &   $     91.14$&$     89.43$&$     85.11$&$     89.41$& $     88.06$&$     84.66$&$     76.46$&$     84.64$  \\
     \midrule
     \midrule
        \textbf{Defense}&  \multicolumn{4}{c|}{\textbf{Gradient Pruning} ($\alpha=0.99$)} & \multicolumn{4}{c}{\textbf{Gaussian Perturbation} ($\sigma=0.01$)} \\
        \midrule
        \textbf{Method}  & Acc.  & Rouge-1 & Rouge-2 & Rouge-L & Acc.  & Rouge-1 & Rouge-2 & Rouge-L  \\
        \midrule
        TAG & $     34.34$ & $     48.50$ & $     10.21$ & $     35.60$ & $     64.34$ & $     66.19$ & $     37.86$ & $     59.55$  \\
        LTI (Ours) & $     70.80$&$     64.24$&$     45.79$&$     64.15$& $     82.49$&$     78.75$&$     67.06$&$     78.71$\\
        \midrule 
        LTI-OOD (Ours) & $     86.19$&$     82.56$&$     73.04$&$     82.50$& $     90.25$&$     87.39$&$     81.94$&$     87.34$\\
     \bottomrule
    \end{tabular}%
    % }
	}
    
    % \vspace{-2ex}
\end{table*}

%\begin{table*}[t!]
%    \centering
%%    \resizebox{\linewidth}{!}{
%    \begin{tabular}{c|cccc|cccc}
%    \toprule
%        \textbf{Defense}& \multicolumn{4}{c|}{\textbf{None}} & \multicolumn{4}{c}{\textbf{Sign Compression}} \\
%        \midrule
%        \textbf{Method}  & Acc.  & Rouge-1 & Rouge-2 & Rouge-L & Acc.  & Rouge-1 & Rouge-2 & Rouge-L  \\
%        \midrule
%         TAG &   $     74.13$ & $     71.92$ & $     50.64$ & $     68.46$& $      0.00$ & $      0.06$ & $      0.00$ & $      0.06$   \\
%         LTI (Ours) &  $     89.61$&$     86.13$&$     79.53$&$     86.11$& $     71.15$&$     63.17$&$     43.51$&$     63.11$  \\
%         LTI-P (Ours) &   $     91.14$&$     89.43$&$     85.11$&$     89.41$& $     88.06$&$     84.66$&$     76.46$&$     84.64$  \\
%     \midrule
%     \midrule
%        \textbf{Defense}&  \multicolumn{4}{c|}{\textbf{Gradient Pruning} ($\alpha=0.99$)} & \multicolumn{4}{c}{\textbf{Gaussian Perturbation} ($\sigma=0.01$)} \\
%        \midrule
%        \textbf{Method}  & Acc.  & Rouge-1 & Rouge-2 & Rouge-L & Acc.  & Rouge-1 & Rouge-2 & Rouge-L  \\
%        \midrule
%        TAG & $     34.34$ & $     48.50$ & $     10.21$ & $     35.60$ & $     64.34$ & $     66.19$ & $     37.86$ & $     59.55$  \\
%        LTI (Ours) & $     66.79$&$     58.31$&$     37.58$&$     58.21$& $     82.08$&$     76.55$&$     63.38$&$     76.52$\\
%        LTI-P (Ours) & $     86.19$&$     82.56$&$     73.04$&$     82.50$& $     90.25$&$     87.39$&$     81.94$&$     87.34$\\
%     \bottomrule
%    \end{tabular}%
%%    }
%    \caption{Results for gradient inversion attack on text data. Both LTI and LTI-P significantly outperform TAG cross different settings in all 4 metrics, where LTI-P achieves the best result with only access to the marginal token distribution for generating the auxiliary dataset.}%Bring them back for more texts.None of methods work for the sign compression of gradients. Our learning-based gradient inversion method outperforms the TAG cross different settings at all metrics.}
%    \label{tab:res_nlp}
%    \vspace{-2ex}
%\end{table*}

%\begin{figure*}[!t]
%    \centering
%    \includegraphics[width=0.95\linewidth]{sections/figs/text_examples.pdf}
%    \vspace{-1ex}
%    \caption{Ground truth text and their reconstructions for 3 random samples from the WikiText test set. LTI-P significantly outperforms TAG both with and without defenses, especially under sign compression where TAG fails to recover any token while LTI-P is capable of recovering almost half of the tokens in each sample.} %The tokens matched to the true text at the same positions are highlighted in blue. TAG predicts many tokens that don't appear in the true text and also sometimes predicts the correct tokens but in the wrong order. In contrast, our method predicts more correct tokens and they are all in the correct positions.
%    \label{fig:text_example}
%    \vspace{-3ex}
%\end{figure*}

\paragraph{Results.} 
\autoref{tab:res_nlp} shows the quantitative comparison between LTI and TAG against various defenses. The overall trend is remarkably consistent: in all 4 metrics, LTI significantly outperforms TAG across different settings (7 out of 8). This shows that our method is easily adapted to the discrete language data and different defenses and is able to achieve great attack performance. 

One observation is that the accuracy of inverted texts when the FL task is the causal language model training is overall much higher than the accuracy when the FL task is the text classifier training. We hypothesize this is because in the task of causal language model, the label in the cross entropy loss is the input sequence itself. On the other hand, The literature \citep{yin2021see, zhao2020idlg} shows how easy it is to reconstruct the labels. 

Another observation is that TAG has a relative low performance at most settings, it achieves the perfect accuracy at the setting of the sign compression when the FL task is the causal language model training. At the first impression, this perfect accuracy is very suspicious. By our carefully check, the explanation is that: if we treat the objective function when the gradient is applied sign compression as a special objective function when the adversary receives the full gradient, the result simply suggests that this special objective function is coincidently better than the one designed for the full gradient. Nevertheless, this phenomenon is not generalized to TAG for the other FL task. This demonstrates that the optimization-based method is very sensitive to the design of the object function.

\paragraph{Out-of-distribution (OOD) auxiliary data.} Instead of assuming the adversary has in-distribution auxiliary texts, we relax this to only assuming the knowledge of the word frequency. Then, we can independently sample the word token for each position in the sentence and get a set of pseudo data. The distribution of pseudo data is out-of-distribution, because the pseudo data loses the inner dependency between different positions of a sentence. We train LTI with the pseudo data and name it as LTI-OOD. 

The results of LTI-OOD are presented in Table~\ref{tab:res_nlp}. LTI outperforms TAG on both CoLA and WikiText dataset at most metrics for all settings of gradient defenses. Moreover, we can observe that LTI-OOD is even better than LTI on WikiText dataset. We hope this promising OOD results can motivate the exploration of OOD generalization of LTI in the future work.

%\autoref{tab:res_nlp} shows quantitative comparison between LTI (and its variant LTI-P) and TAG against various defenses. The overall trend is remarkably consistent: LTI and LTI-P outperform TAG in all four metrics for all defense settings, with LTI-P achieving state-of-the-art recovery accuracy by far. This result suggests that knowledge of the marginal token distribution encodes enough data prior for LTI-P to train the inversion model, and having access to infinite training data allows it to better generalize to the test set compared to LTI. In practice, it is very plausible that the marginal token distribution is known to the adversary, and hence LTI-P serves as a surprisingly simple and effective baseline for gradient inversion in NLP.

%\autoref{fig:text_example} shows 3 random test samples from WikiText and their reconstructions using LTI-P and TAG, with tokens that are correctly reconstructed highlighted in blue. Without any defense, both TAG and LTI-P yield reasonably accurate reconstructions, with LTI-P faithfully reconstructing all but 1-2 tokens. 
%With the sign compression defense applied, TAG fails to recover \emph{any} token correctly, whereas LTI-P can faithfully recover almost half of the tokens in each sample. Results for gradient pruning and gradient perturbation yield similar conclusions, with TAG recovering a larger but still relatively insignificant set of tokens. Additional samples are given in the appendix.
