\section{Experimental Evaluation}
We evaluate the models along two dimensions: their ability to recover the input sequence from a given continuation, and their robustness to GCG attacks. Both aspects are essential for developing robust and grounded LLMs.

\subsection{Inversion Procedure}
The evaluation relies on three complementary measures. Validation loss and validation accuracy are computed under the same conditions as training, serving as standard indicators of model fit. In addition, we report Inverse LM accuracy, which evaluates the model's ability to reconstruct a masked token $\bx_i$ from its gradients, given the remaining context. This provides a direct assessment of the backward prediction mechanism.

Following \autoref{sec:method}, inversion replaces a target token \( x_i \) according to the training strategy and computes \( \nabla_{e_i} L_{CE} \) on the modified sequence. The gradients are mapped via \( \phi(\cdot) \), normalized, and projected through the LM head to obtain \( \hat{y}_i \) (\autoref{eq:inverse_lm_backward_prediction_general}).

\subsection{Inversion Evaluation}
To assess inversion capabilities, we extend the task beyond single-token recovery and instead invert multiple tokens in an autoregressive manner.  
The procedure follows a beam-search strategy, as detailed in Algorithm~\ref{alg:inversion_evaluation}, where candidate prefixes are iteratively expanded and filtered by perplexity until a coherent reconstruction emerges.
\begin{algorithm}
\caption{Autoregressive Inversion Evaluation with Beam Search}
    \label{alg:inversion_evaluation}
    \begin{algorithmic}[1]
        \Require Input sample $\bx$ of length $n$, beam size $b$, split position $k$
        \Ensure Inverted prefix $\bx_{\mathrm{inv}}$
        \State $\bx_p \gets \bx_{0:k}$ \Comment{Original prefix (hidden)}
        \State $\bx_s \gets \bx_{k:n}$ \Comment{Visible suffix}
        \State $\bX \gets \{\bx_s\}$ \Comment{Initialize beam set with suffix only}
        \While{inverted prefix not sufficiently long}
            \For{each sequence $\bx' \in \bX$}
                \State Compute top-$b$ tokens for the previous position
                \State Extend $\bx'$ with each candidate token
            \EndFor
            \State $\bX \gets$ top-$b$ sequences with lowest perplexity
        \EndWhile
        \State \Return $\bx_{\mathrm{inv}} \gets \argmin_{\bx' \in \bX} \text{Perplexity}(\bx')$
    \end{algorithmic}
\end{algorithm}



In the evaluation process, we consider only the combination of initialization strategy and model variant used during training. In particular, the \texttt{Identity} variant initializes the unknown token using a simple bigram model. In contrast, all other variants use a fixed placeholder token (e.g., \texttt{<|pad|>}), consistent with their training setup. Since the true token is not available at inference time, the bigram initialization provides a reasonable approximation, yielding better inversion performance than random initialization or a fixed placeholder that was not observed during training in this context.

We evaluate inversion quality using both token-level and sequence-level metrics. 
Token-level metrics assess reconstruction fidelity and include Recall, Precision, F1 score, and Accuracy. 
Sequence-level metrics evaluate the plausibility and semantic alignment of the reconstructed prefix, including Full PPL (perplexity of the reconstructed prefix concatenated with the continuation), Prefix PPL (perplexity of the reconstructed prefix alone), and semantic similarity with respect to the ground-truth prefix.

Perplexity-based metrics are computed using a third-party language model (Llama-3.2-1B)\footnote{\url{https://huggingface.co/meta-llama/Llama-3.2-1B}} to decouple evaluation from the trained models. 
Semantic similarity is computed as the cosine similarity between sentence embeddings obtained from an external encoder\footnote{\url{https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2}}.



\begin{table}[t]
\centering
\caption{
Inversion performance combining token-level and sequence-level metrics.
\textbf{Left:} token-level reconstruction metrics (Rec, Prec, F1, Acc).
\textbf{Right:} sequence-level metrics (Full PPL, Prefix PPL, Prefix Sim) evaluating plausibility and semantic alignment of the reconstructed prefix.
}
\label{tab:ilm-unified}
\resizebox{\linewidth}{!}{%
\setlength{\tabcolsep}{4pt}
\begin{tabular}{l cccc ccc}
\toprule
& \multicolumn{4}{c}{\textbf{Token-level}} & \multicolumn{3}{c}{\textbf{Sequence-level}} \\
\cmidrule(lr){2-5} \cmidrule(lr){6-8}
\textbf{Method} 
& \textbf{Rec $\uparrow$} 
& \textbf{Prec $\uparrow$} 
& \textbf{F1 $\uparrow$} 
& \textbf{Acc $\uparrow$} 
& \textbf{Full PPL $\downarrow$} 
& \textbf{Pref. PPL $\downarrow$} 
& \textbf{Pref. Sim $\uparrow$} \\
\midrule
Baseline 
& 20.9\%          & 18.8\%          & 19.7\%          & 2.4\% 
& \textbf{8.34} & 112.82 & 0.28 \\
\midrule
\multicolumn{8}{c}{\textit{Gradient as Value}} \\
Inv-First 
& 11.3\%          & 10.1\%          & 10.7\%          & 1.7\%
& 10.21 & 1576.23 & 0.25 \\
Bert-like 
& 2.9\%           & 2.7\%           & 2.8\%           & 0.3\%
& 11.54 & 5501.86 & 0.17 \\
Identity  
& 0.7\%           & 0.7\%           & 0.7\%           & 0.1\%  
& 13.88 & 14658.58 & 0.12 \\
\midrule
\multicolumn{8}{c}{\textit{Gradient as Direction}} \\
Inv-First 
& 13.3\%          & 12.0\%          & 12.6\%          & 2.4\%   
& 9.77 & 1012.80 & \textbf{0.30} \\
Bert-like 
& 0.1\%           & 0.1\%           & 0.1\%           & 0.1\%  
& 11.05 & 563.26 & 0.11 \\
Identity  
& \textbf{22.5\%} & \textbf{20.2\%} & \textbf{21.2\%} & \textbf{2.5\%}
& \textbf{8.34} & \textbf{106.31} & \textbf{0.30} \\
\bottomrule
\end{tabular}
}
\end{table}

Among the evaluated metrics, \emph{Full PPL} exhibits relatively low variance across models, as it is computed over the entire sequence, where the continuation dominates the total length. Consequently, differences between models are attenuated and should be interpreted comparatively.

For reference, the perplexity of the ground-truth prefix is $37.83$, whereas reconstructed prefixes exhibit substantially higher values across all models. This highlights the intrinsic difficulty of the inversion task and the reduced naturalness of generated prefixes.

Among the evaluated metrics, \emph{Prefix PPL} shows the largest variation across training strategies. Models that rely on gradients as raw values produce extremely high prefix perplexity, indicating poor fluency and unstable reconstructions. In contrast, using gradients as directions significantly reduces Prefix PPL (e.g., 106.31 for Identity), suggesting that directional gradients provide a more informative signal for inversion and yield more coherent prefixes.

Overall, the Identity variant with gradients as directions achieves the best trade-off, combining the highest token-level performance with the lowest Prefix PPL and competitive semantic similarity. While none of the models match the naturalness of the ground-truth prefix, these results indicate that \emph{Identity (grad.\ direction)} provides the best balance between reconstruction fidelity and linguistic plausibility.

\subsection{Robustness Against GCG}
We evaluate robustness using the success rate of Greedy Coordinate Gradient attacks~\cite{zou2023universal}.  
This benchmark aligns naturally with the objectives of our training procedure: our ultimate goal is to produce LLMs that are more grounded, ensuring that their responses are faithful to and informed by the prompts they receive.
The procedure for evaluating the models follows these rules, applied to 30\% of randomly selected samples from the test set, consistently across all model variants.

\begin{algorithm}
    \caption{Single-Sentence GCG Attack}
    \label{alg:single_sentence_gcg}
    \begin{algorithmic}[1]
        \Require Expected continuation string $\by$ to be attacked,
                    length of the attack prefix $n$,
                    number of iterations $T$
        \Ensure Best attack prefix $\bxa$ with loss $\loss_\text{GCG}$
        
        \State $\bxa \gets$ random one-hot tokens matrix of size $|V| \times n$
        \State $step \gets 0$                       \Comment{Iteration counter}
        \State $d \gets 0$                          \Comment{Loss non-decrease counter}
        \State $\loss_{\text{old}} \gets \infty$    \Comment{Last loss found}
        
        \While{$step < T$ \textbf{and} $d < 10$}
            \State Compute a batch of candidate prefixes $\bX$ running one step of \textbf{GCG}
            \State $\bxa \gets \text{arg\;min}_{\bx \in \bX} \loss_\text{CE}(\bx, \by, \net)$
            \State $\loss_\text{GCG} \gets \loss_\text{CE}(\bxa, \by, \net)$    \Comment{Take the min loss so far}
            \If{$\loss_\text{GCG} < \loss_{\text{old}}$}
                \State $\loss_{\text{old}} \gets \loss_\text{GCG}$
                \State $d \gets 0$
            \Else
                \State $d \gets d + 1$
            \EndIf
            \State $step \gets step + 1$
        \EndWhile
        
        \State \Return $\loss_\text{GCG}$
    \end{algorithmic}
\end{algorithm}

\begin{figure}[t]
\centering
\begin{minipage}{0.56\linewidth}
    \centering
    \includegraphics[alt={Line plot showing the success rate of Greedy Coordinate Gradient attacks as a function of the number of optimization iterations. Multiple curves represent different model variants, including baseline and ILM configurations using gradients as values or directions. The x-axis shows iteration count up to 500, and the y-axis shows attack success rate from 0 to 1. Lower curves indicate stronger robustness. Some variants achieve consistently lower success rates, while others converge to higher values as iterations increase.}, width=\linewidth]{assets/gcg_success_rate_varying_steps.pdf}
    \caption{
    GCG attack success rate (SR) as a function of the number of optimization iterations. Lower values indicate stronger robustness.
    }
    \label{fig:gcg_success_rate_varying_steps}
\end{minipage}
\hfill
\begin{minipage}{0.42\linewidth}
    \centering
    \small
    \captionof{table}{GCG attack success rate (SR) for ILM variants. Lower is better.}
    \label{tab:small_tinystories_gcg_results}
    \begin{tabular}{lcc}
    \toprule
    \textbf{Model} & \textbf{SR $\downarrow$} & \textbf{Steps ($\mu \pm \sigma$)} \\
    \midrule
    Baseline & 95.9\% & 277 $\pm$ 148 \\
    \midrule
    \multicolumn{3}{c}{\textit{Gradient as Value}} \\
    Inv-First  & 85.0\% & 320 $\pm$ 134 \\
    Bert-like  & \textbf{0.8\%} & 249 $\pm$ 148 \\
    Identity   & 88.1\% & 274 $\pm$ 145 \\
    \midrule
    \multicolumn{3}{c}{\textit{Gradient as Direction}} \\
    Inv-First  & 89.3\% & 313 $\pm$ 134 \\
    Bert-like  & 85.5\% & 287 $\pm$ 143 \\
    Identity   & \underline{82.8\%} & 284 $\pm$ 141 \\
    \bottomrule
    \end{tabular}
\end{minipage}
\end{figure}

From the results reported in Table~\ref{tab:small_tinystories_gcg_results}, most variants show improved robustness against GCG attacks. In particular, the variant identified as best in the inversion task, \emph{Identity (Gradient as Direction)}, reduces the attack success rate by more than 13\%. This improvement can be attributed to the model's stronger conditioning of the continuation on the input prompt (i.e., improved \emph{grounding}), which leads to increased robustness.

However, the \emph{Bert-like (Gradient as Value)} variant exhibits an unusually low GCG success rate compared to all other models, suggesting markedly higher robustness to gradient-based white-box attacks. Given the magnitude of this effect, we repeated the experiments from the initial training phase to rule out potential artifacts. The results were consistent across runs, confirming the stability of this observation. Nevertheless, the underlying cause remains unclear and warrants further investigation.

We further analyze the relationship between the maximum number of GCG iterations and the attack success rate, defined as the fraction of tokens shared between the model outputs for the original input $\bx$ and the adversarial input $\bx'$. All other hyperparameters (e.g., search window width) are kept fixed. As shown in Figure~\ref{fig:gcg_success_rate_varying_steps}, some curves intersect as the number of iterations increases, suggesting that different variants may be more effective under different optimization budgets. For example, \emph{Inv-First (Gradient as Direction)} outperforms \emph{Bert-like (Gradient as Direction)} at low iteration counts, while the opposite holds at the maximum number of iterations. Overall, this effect is limited and does not substantially alter the conclusions drawn from Table~\ref{tab:small_tinystories_gcg_results}.


To further analyze GCG behavior, we report additional metrics computed on successful attacks. Specifically, we measure the cross-entropy loss on the original input, $\Loss_{\text{CE}}(f_\theta(\bx), \by)$, and on the adversarial input, $\Loss_{\text{CE}}(f_\theta(\bx'), \by)$, capturing how well the target continuation aligns with $\bx$ and $\bx'$, respectively. We also compute the KL divergence $KL\big(f_\theta(\bx), f_\theta(\bx')\big)$ to quantify the shift in the model's output distribution induced by the perturbation, where $f_\theta(\bx)$ denotes the output distribution over $\by$.


From the metrics in Table~\ref{tab:gcg_merged}, we observe that robust variants, such as \emph{Identity (Gradient as Direction)}, not only exhibit a substantially lower attack success rate compared to the baseline, but also show a smaller increase in cross-entropy loss when the attack succeeds.
This increase, defined as the difference between the loss on the original input and that on the adversarial sequence, quantifies the extent to which the perturbation misleads the model. Larger delta values indicate greater susceptibility, as the model assigns a higher likelihood to the target continuation $\by$ under the adversarial input.

Finally, the KL divergence measures the discrepancy between the output distributions $f_\theta(\bx)$ and $f_\theta(\bx')$. A larger divergence indicates that the model distinguishes more effectively between the original and adversarial inputs, reflecting stronger robustness to perturbations.

\begin{table}[t]
\centering
\small
\caption{GCG attack evaluation. \textbf{Left:} metrics computed on the best adversarial prefix found by GCG (cross-entropy loss and KL divergence). \textbf{Right:} average statistics over successful adversarial prefixes evaluated with a third-party model (perplexity and semantic similarity).}
\label{tab:gcg_merged}
\begin{tabular}{rccccccc}
\toprule
 & \multicolumn{4}{c}{\textbf{Best Attack}} & \multicolumn{3}{c}{\textbf{Average Prefix}} \\
\cmidrule(lr){2-5} \cmidrule(lr){6-8}
 & \textbf{Orig. X} & \textbf{Attack X'} & \textbf{Delta} & \textbf{KL} 
 & \textbf{Orig. X} & \textbf{Attack X'} & \textbf{Semantic} \\
 & \textbf{CE $\downarrow$} & \textbf{CE $\downarrow$} & \textbf{CE $\downarrow$} & \textbf{Div. $\uparrow$}
 & \textbf{PPL} & \textbf{PPL $\downarrow$} & \textbf{Similarity $\uparrow$} \\
\midrule
Baseline & 13.28 & 10.97 & 2.31 & 2.19 & 44.14 & 17344.04 & 0.13 \\

\midrule
\multicolumn{8}{c}{\textit{Gradient as Value}} \\
Inv-First & \textbf{11.09} & \textbf{9.72} & \underline{1.37} & 2.44 & 44.81 & \underline{9431.09} & \underline{0.16} \\
Bert-like & 13.26 & 10.25 & 3.01 & \textbf{54.19} & 40.37 & 11817.21 & 0.11 \\
Identity  & 12.77 & 11.21 & 1.56 & 2.23 & 43.98 & \textbf{8322.25} & \textbf{0.18} \\

\midrule
\multicolumn{8}{c}{\textit{Gradient as Direction}} \\
Inv-First & \underline{11.21} & \underline{9.81} & 1.40 & 2.44 & 43.50 & 12344.85 & 0.13 \\
Bert-like & 11.49 & 10.34 & \textbf{1.15} & 2.23 & 44.74 & 10611.09 & 0.13 \\
Identity  & 12.58 & 11.12 & 1.46 & \underline{2.47} & 44.71 & 10929.21 & 0.15 \\

\bottomrule
\end{tabular}
\end{table}

To obtain a complete evaluation, we adopt a similar approach to the one we used during inversion: using a third-party model to compute additional statistics lets us abstract from the biases in our LLMs. Here, since the perplexity of the attack prefix is computed with a third-party independent model, it can easily return the real naturalness of the generated prefix, instead of being influenced by the attack itself and wrongly reporting that it will be even more natural than the human prefix.

Recall that we are considering only successful attacks, ignoring those that have failed, since they are not useful for understanding the quality of the attacks. Also, because of that, the results involving the \texttt{bert-like} variant with gradients as values will have a much smaller number of samples contributing to these metrics.


\subsection{Forward Mode Evaluation}

\begin{table}[t]
\centering
\footnotesize
\caption{Forward-mode evaluation. Values report perplexity and cross-entropy (CE). $\Delta$ denotes change relative to the baseline.}
\label{tab:inverse_lm_forward_mode_quantitative}
\begin{tabular}{lcc|cc}
\toprule
 & \multicolumn{2}{c}{\textbf{Perplexity $\downarrow$}} & \multicolumn{2}{c}{\textbf{CE Loss $\downarrow$}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5}
\textbf{Method} & \textbf{Value} & \textbf{$\Delta$} & \textbf{Value} & \textbf{$\Delta$} \\
\midrule
Baseline & \textbf{4.83} & – & \textbf{1.58} & – \\
\midrule
\multicolumn{5}{c}{\textit{Gradient as Value}} \\
Inv-First  & 8.41 & +3.58 & 2.13 & +0.55 \\
Bert-like  & 5.79 & +0.96 & 1.76 & +0.18 \\
Identity   & 5.07 & +0.24 & 1.63 & +0.05 \\

\midrule
\multicolumn{5}{c}{\textit{Gradient as Direction}} \\
Inv-First  & 6.82 & +1.99 & 1.92 & +0.34 \\
Bert-like  & 5.42 & +0.59 & 1.69 & +0.11 \\
Identity   & 5.08 & +0.25 & 1.62 & +0.04 \\

\bottomrule
\end{tabular}
\end{table}

Finally, we verify that the proposed models retain proper forward-mode behavior, without experiencing performance degradation. By analyzing perplexity during both training and validation, we observe that introducing the gradient-based regularization term $\nabla_\be\loss_\text{CE}$ does not impair the model's ability to generate fluent text during standard forward inference.
This result is particularly relevant, as adversarial training methods often require additional parameters or extended training to maintain comparable forward-mode performance, because part of the model's capacity is allocated to the adversarial objective.

As shown in Table~\ref{tab:inverse_lm_forward_mode_quantitative}, the worst-performing variant is \emph{Inv-First (grad. value)}. Its higher perplexity is consistent with the qualitative examples reported below, in which the generated sentences appear less coherent and more repetitive than those from the other models.

\begin{table}
    \caption{Example completion for the given prompt, in forward mode.}
    \label{tab:inverse_lm__forward_mode_execution}
    \resizebox{\linewidth}{!}{
    \begin{tabular}{r c}
    \toprule
    \textbf{Method}           & \textbf{Completion for ``One day,''} \\
    \midrule
    Baseline   & a little boy named Tim wanted to travel to a far mountain. He asked his dad for a raft, \\
    
    \midrule
    \multicolumn{2}{c}{\textit{Gradient as Value}} \\
    Inv-First  & they pinch. They find gold. They take pictures of stars. \\
    Bert-like  & a little boy was walking in the park. He noticed a big, shiny object in the park. \\
    Identity   & a little girl named Lucy went to the park with her mom. Lucy liked to play on the swings \\

    \midrule
    Inv-First  & hey went to the beach with his mom. He saw something shiny and strange inside. \\
    Bert-like  & a little girl named Lucy was playing in the garden. She saw a shiny ring on a branch. \\
    \multicolumn{2}{c}{\textit{Gradient as Direction}} \\
    Identity   & a little girl named Amy was playing outside. She saw a big tree and thought it was a toy. \\
    \bottomrule
    \end{tabular}
    }
\end{table}