%\setcounter{theorem}{0}
%\setcounter{lemma}{0}
%\setcounter{assumption}{0}
% \setcounter{equation}{0}

\section{Modifications for Baseline Methods}
\paragraph{Vision baselines.} 
IG and GI-GIP use cosine distance between the received gradient and the gradient of dummy data for optimizing the dummy data. 
However, reusing this objective when defense mechanisms are applied is not reasonable.

For the \emph{sign compression} defense, this loss function does not optimize the correct objective since the dummy data's gradient is \emph{not} a vector with $\pm 1$ entries but rather a real-valued vector with the same sign.
When $B=1$, we can simiply replace cosine distance by the loss $\sum_{i=1}^m\left(\ell_{\rm sign}^i\right)^2$ where 
\begin{equation}
    \label{eq:sign_loss}
    \ell_{\rm sign}^i =\max\left\{-\nabla_{\bw_i} \ell(f_{\bw}(\tilde{\bx}), \tilde{y})\cdot \mathrm{Sign}\left( \nabla_{\bw_i} \ell(f_{\bw}(\bx), y)\right), 0\right\}.
\end{equation}
One sanity check for this loss is that when $\nabla_{\bw_i} \ell(f_{\bw}(\tilde{\bx}), \tilde{y})$ has the same sign as that of $\nabla_{\bw_i} \ell(f_{\bw}(\bx), y)$, the minimum loss value of $0$ is achieved.
When $B>1$, the above objective can't be applied anymore because the adversary only receives the average of the gradients that are compressed to sign and doesn't know the gradient sign for each single data.
Because sign operation is not reasonably differentiable, we can't compute the average of sign gradients from dummy data and reuse the cosine distance as the objective function.
However, the $tanh$ function is approximate to the sign operation and is differentiable. 
Thus, the solution is to apply $tanh$ to the gradient of each dummy data, compute the average of them, and reuse the cosine distance between this average and the received gradient.

For the \emph{gradient pruning} defense, optimizing the cosine distance between the dummy data gradient and the pruned ground truth gradient will force too many gradient values to $0$, which is the incorrect value for the ground truth gradient.
Therefore we only compute cosine distance over the non-zero dimensions of the pruned gradient. 

\paragraph{Language baselines.} For TAG, we find that the loss function also needs to be modified slightly to accommodate the \emph{sign compression} and \emph{gradient pruning} defenses:
\begin{itemize}[leftmargin=*,nosep]
    \item \textit{Sign compression.} Similar to the vision baselines, the $\ell_2$ and $\ell_1$ distance between the dummy data gradient and the ground truth gradient sign do not optimize the correct objective. When $B=1$, we can simply replace $\|\cdot\|_2^2$ and $\|\cdot\|_1$ by $\sum_{i=1}^m \left(\ell_{\rm sign}^i\right)^2 $ and $\ell_{\rm sign}^i$, respectively, where $\sum_{i=1}^m\ell_{\rm sign}^i$ is defined in \autoref{eq:sign_loss}. We make the modification similar to the vision baselines when $B>1$.
    \item \textit{Gradient pruning.} We make the same modification to TAG as in the vision baselines.
\end{itemize}

\section{Additional Quantitative Evaluation}
In the experiment of vision tasks, we evaluate the gradient inversion attacks in three metrics: MSE, PSNR, LPIPS, SSIM. In the main text, we showed the result table for MSE. \autoref{tab:res_cv_psnr}, \autoref{tab:res_cv_lpips} and \autoref{tab:res_cv_ssim} are the result tables for PSNR, LPIPS and SSIM. Similar to the trends in the MSE table, LTI is the best when the defense mechanisms are applied. 
\begin{table*}[t!]
\centering
    \resizebox{\linewidth}{!}{
    \begin{tabular}{c|c|cccc|cccc}
    \toprule
     \multirow{2}{*}{\textbf{FL model}}  & \multirow{2}{*}{\textbf{Methods}} & \multicolumn{4}{c|}{$B=1$}& \multicolumn{4}{c}{$B=4$}\\
     \cmidrule{3-10}
         & & \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.} & \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.}\\ 
        \midrule
        \multirow{3}{*}{LeNet} & IG & 22.290 & 9.981 & 8.807 & 8.349 & 10.102 & 5.808 & 8.175 & 6.891\\ 
         & GI-GIP   & \textbf{33.374} & 13.574 & 14.356 & 9.383 &  \textbf{23.891} & 10.953 & 7.606 & 8.347\\ 
         & LTI (Ours) & 24.837 &  \textbf{18.986} &  \textbf{15.897} &  \textbf{20.249} & 19.491 &  \textbf{16.991} &  \textbf{15.643} &  \textbf{16.619}\\
         \midrule
         \multirow{3}{*}{ResNet20} & IG & 9.285 & 8.416 & 7.722 & 8.934 & 9.171 & 5.675 & 7.207 & 9.225\\ 
         & GI-GIP   & 12.609  & 10.391  & 6.286  & 6.461 & 11.064 & 6.532 & 6.562 & 6.622\\ 
         & LTI (Ours) & \textbf{18.007} & \textbf{19.435} & \textbf{16.957} & \textbf{17.367} & \textbf{12.593} & \textbf{12.290} & \textbf{12.530} & \textbf{12.613}\\
     \bottomrule
    \end{tabular}%
    }
    \caption{PSNR for baselines (IG and GI-GIP) and our method LTI on CIFAR10.} %.} %Bring it back when needing more texts:  These results show that our method is easily adapted and has the power to break the privacy protection from the compression and randomness.
    \label{tab:res_cv_psnr}
    \vspace{-2ex}
\end{table*}

\begin{table*}[t!]
\centering
    \resizebox{\linewidth}{!}{%
    \begin{tabular}{c|c|cccc|cccc}
    \toprule
      \multirow{2}{*}{\textbf{FL model}}  & \multirow{2}{*}{\textbf{Methods}} & \multicolumn{4}{c|}{$B=1$}& \multicolumn{4}{c}{$B=4$}\\
     \cmidrule{3-10}
         & & \ \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.} & \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.}\\ 
        \midrule
        \multirow{3}{*}{LeNet} & IG         & 0.263 & 0.677          & 0.675          & 0.653          & 0.615          & 0.712          & 0.690          & 0.691          \\
& GI-GIP     & \textbf{0.033}                & 0.471          & 0.474          & 0.568          & \textbf{0.212} & 0.586          & 0.695          & 0.678          \\
& LTI (Ours) & 0.221                & \textbf{0.396} & \textbf{0.472} & \textbf{0.370} & 0.391 & \textbf{0.467} & \textbf{0.489} & \textbf{0.470}\\
    \midrule
    \multirow{3}{*}{ResNet20} & IG         & 0.655 & 0.678          & 0.688          & 0.660          & 0.658          & 0.714          & 0.704          & 0.656          \\
& GI-GIP     & 0.557                & 0.650          & 0.706          & 0.701          & \textbf{0.586} & 0.671          & 0.714          & 0.712          \\
& LTI (Ours) & \textbf{0.524}                & \textbf{0.431} & \textbf{0.541} & \textbf{0.529} & 0.628 & \textbf{0.580} & \textbf{0.609} & \textbf{0.620}\\
     \bottomrule
    \end{tabular}%
    }
    \caption{LPIPS  for baselines (IG and GI-GIP) and our method LTI on CIFAR10.} %.} %Bring it back when needing more texts:  These results show that our method is easily adapted and has the power to break the privacy protection from the compression and randomness.
    \label{tab:res_cv_lpips}
    \vspace{-2ex}
\end{table*}


\begin{table*}[t!]
\centering
    \resizebox{\linewidth}{!}{%
    \begin{tabular}{c|c|cccc|cccc}
    \toprule
      \multirow{2}{*}{\textbf{FL model}}  & \multirow{2}{*}{\textbf{Methods}} & \multicolumn{4}{c|}{$B=1$}& \multicolumn{4}{c}{$B=4$}\\
     \cmidrule{3-10}
         & & \ \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.} & \textbf{None} & \textbf{Sign Comp.} & \textbf{Grad. Prun.} & \textbf{Gauss. Pert.}\\ 
        \midrule
        \multirow{3}{*}{LeNet} & IG         & 0.711         & 	0.060         & 	0.052         & 	0.149         & 	0.020         & 	0.018         & 	0.025     & 	0.058\\
& GI-GIP     & \textbf{0.970}		&	0.301	&	0.346	&	0.072	&	\textbf{0.805}	&	0.307	&	0.010	&	0.013         \\
& LTI (Ours) & 0.845	&	\textbf{0.599}	&	\textbf{0.378}	&	\textbf{0.636}	&	0.583	&	\textbf{0.432}	&	\textbf{0.330}	&	\textbf{0.425}\\
    \midrule
    \multirow{3}{*}{ResNet20} & IG         & 0.071	&	0.037	&	0.023	&	0.067		&0.046	&	0.009	&	0.018	&	0.053         \\
& GI-GIP     & 0.167		&0.049	&	0.004	&	0.008	&	0.100	&	0.034	&	0.012	&	0.012 \\
& LTI (Ours) & \textbf{0.417	}	&\textbf{0.556}	&	\textbf{0.349}	&	\textbf{0.376}	&	\textbf{0.194}	&	\textbf{0.256}	&	\textbf{0.210}	&	\textbf{0.201}\\
     \bottomrule
    \end{tabular}%
    }
    \caption{SSIM for baselines (IG and GI-GIP) and our method LTI on CIFAR10.} %.} %Bring it back when needing more texts:  These results show that our method is easily adapted and has the power to break the privacy protection from the compression and randomness.
    \label{tab:res_cv_ssim}
    \vspace{-2ex}
\end{table*}

\section{Auxiliary Dataset Ablation Studies}
\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{sections/figs/size_vision_appendix.pdf}
    \caption{Plot of reconstruction PSNR / LPIPS / SSIM vs. auxiliary dataset size on CIFAR10.}
    \label{fig:ablation_appendix_size}
\end{figure}
\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{sections/figs/shift_vision_appendix.pdf}
    \caption{Plot of reconstruction PSNR / LPIPS / SSIM vs. auxiliary dataset distribution on CIFAR10.}
    \label{fig:ablation_appendix_shift}
\end{figure}

In the experiment section, we showed reconstruction MSE for LTI as a function of the auxiliary dataset size and the shift factor $\beta$. 
For completeness, we show the corresponding PSNR, LPIPS and SSIM curves in \autoref{fig:ablation_appendix_size} and \autoref{fig:ablation_appendix_shift}.
Similar to Figure 2 in the main text, when reducing the auxiliary dataset size (\emph{e.g.}, from $50,000$ to $5,000$) or reducing the proportion of in-distribution data (\emph{e.g.}, from $\beta=1$ to $\beta=0.1$), the performance of LTI does not worsen significantly.

\section{Additional Examples}

\begin{figure}[b]
    \centering
    \includegraphics[width=0.85\linewidth]{sections/figs/image_examples_more.pdf}
    \caption{Additional samples from CIFAR10 and their reconstructions from the gradient of LeNet.}
    \label{fig:img_example_more}
\end{figure}
\begin{figure}[b]
    \centering
    \includegraphics[width=0.85\linewidth]{sections/figs/image_examples_resnet.pdf}
    \caption{Additional samples from CIFAR10 and their reconstructions from the gradient of ResNet20.}
    \label{fig:img_example_resnet}
\end{figure}

\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{sections/figs/text_examples_cola.pdf}
    \caption{Samples from CoLA and their reconstructions.}
    \label{fig:text_example_cola}
\end{figure}

\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{sections/figs/text_examples_wikitext.pdf}
    \caption{Samples from Wikitext and their reconstructions.}
    \label{fig:text_example_wikitext}
\end{figure}

\subsection{Examples on Vision Data}
\autoref{fig:img_example_more} shows additional samples and the reconstructions of attacks under various defense mechanisms on CIFAR10 dataset when the gradients are computed from LeNet. Similar to what we observe from the figure in the main text, all attacks can mostly reconstruct the data when there is no defense mechanism applied, while LTI is the only successful method when the defense mechanisms are applied. 

\autoref{fig:img_example_resnet} shows the examples when the FL model is ResNet20. We can observe that LTI is the only method that can reveal the partial object information of the original images across all gradient settings (including the setting where no defense mechanism is applied.)

\subsection{Examples on Language Data}
\autoref{fig:text_example_cola} shows three samples, including two good examples and one bad example (w.r.t. LTI), from CoLA dataset and their reconstructions when different defense mechanisms are applied.
The first observation is that LTI significantly performs better than TAG especially when the defense mechanisms are applied. Moreover, we find that the reconstruction error types of the two methods are different.
The error of TAG comes from both the wrong token prediction and the wrong token position prediction. In the reconstruction of TAG, many random tokens appear. 
Though the error of TAG is mostly the wrong token prediction, while the wrong tokens are the tokens with the high frequencies such as "the".

We also show three samples from WikiText dataset and the gradient inversion results from TAG and LTI in \autoref{fig:text_example_wikitext}. 
The comparison between TAG and LTI matches the results of quantitative evaluation in the main text: TAG has perfect performance when sign compression is applied, while LTI outperforms TAG in the other three settings.
