\subsection{Inverse Language Modeling}
ILM builds on PAG by extending it to the language modeling setting. The method is non-iterative and relies on double backpropagation~\cite{drucker1992improving}, which can be efficiently implemented with standard autograd tools.

ILM performs the following: instead of training the LLM to \emph{only} maximize $p(\by|\bx)$,
we also invert it and from the output $\by$, we aim to reconstruct the input $\bx$.
This procedure is not merely a double forward pass of the original text and its reverse. Instead, we first impose a loss for $p(\by|\bx)$, yet instead of updating the weights, we also receive gradients over input tokens $\nabla_\bx \Loss(\bx,\by;\net)$ requiring them to predict some tokens in $\bx$, depending on the exact model variant among the ones discussed later.
This focus on bidirectional understanding during pretraining is key to improving the model’s comprehension.

\begin{figure}
    \centering
    \includegraphics[alt={Diagram showing the relationship between forward hidden states and backward gradients in a language model. In the forward pass, token embeddings are processed to produce hidden states and output logits for next-token prediction. In the backward pass, gradients of the loss with respect to token embeddings are transformed and passed through the same language model head to reconstruct input tokens. The figure highlights the symmetry between hidden states and gradients for prediction.
},width=\linewidth]{assets/grad_lm_head_parallelism.drawio.pdf}
    \caption{Parallelism between last hidden states and embedding gradients: both can be mapped through the LM head to token predictions.}
    \label{fig:grad_lm_head_parallelism}
\end{figure}

In a causal Transformer architecture, the influence of a single input token demonstrates a fundamental asymmetry between the forward and backward passes. During the forward pass, causal masking ensures that the hidden state at step \( t \) depends solely on the current and preceding tokens (i.e., \( h_t = f(e_{\le t}) \)). Therefore, modifying a token embedding \( e_i \) affects the forward activations only for steps \( t \ge i \), while the preceding states (where \( t < i \)) remain completely unchanged.

In contrast, this causal isolation disappears during backpropagation. The total loss \( \mathcal{L} = \sum_t \mathcal{L}_t \) is computed by evaluating future queries against past keys and values. This means that the loss at step \( t \) becomes a highly non-linear function of all preceding tokens. Changing \( e_i \) affects the queries, the attention weights, and subsequently the states for all \( t \ge i \). When these modified future states are evaluated against the keys of past tokens (\( e_{<i} \)), the backward gradient signals sent to those earlier embeddings are fundamentally altered.

As a result, while the forward pass strictly prevents a token from influencing the past, the dense mixing in the backward pass ensures that modifying a single token will change the gradients propagated back to preceding tokens.

We will use ILM at training time, while at test time we exploit the GCG algorithm to find, given an original text prompt $\bx$ and the completion $\by$, a new nonsensical $\bxa$ such that the loss $\Loss(\bxa,\by;\net) \ll \Loss(\bx,\by;\net)$, where $\Loss$ is the next-token prediction loss of the LLM and $\net$ are LLM’s parameters. We show that a nonsensical prompt $\bxa$ can achieve lower loss with $\by$ than the natural $\bx$.

The standard formulation of PAG in image classification (Eq.~\ref{eq:pag_classification_loss}) is not directly transferable to LLMs due to fundamental differences in the input structure. Images are continuous tensors that encode class-discriminative information locally, whereas LLMs operate on discrete tokens that depend on full sequence context and are represented as one-hot vectors with limited semantic content.

Moreover, computing gradients with respect to input tokens is challenging: token-level gradients lack global context, and the large vocabulary size -- often hundreds of thousands of tokens -- makes class-iterative PAG computationally infeasible due to the high dimensionality of the prediction space.

Consequently, our approach deviates from the standard PAG for classifiers. We focus solely on gradients with respect to the actual input tokens and intend to classify the actual tokens, as in Figure~\ref{fig:inverse_lm_idea_schema}, rather than using the cosine distance for gradient direction, aiming to leverage this for LLM inversion as well.

For these experiments, our model architecture is a small decoder-only transformer with Weight Tying~\cite{press2017using,inan2016tying} enabled, 3 hidden layers, and a hidden-layer vector size of 640.

The dataset used is TinyStories~\cite{eldan2023tinystories}, with a tokenizer trained from scratch using the standard Byte-Pair Encoding~\cite{gage1994new}, to provide a flexible vocabulary size and support experiments of varying complexity and entropy in next-token classification. Specifically, we used a vocabulary of 2048 possible tokens.
Also, the dataset samples include an overlap of $25\%$ with the original sentences. This overlap increases variability in sentence starts, providing the model with more diverse context patterns and better approximating realistic sentence-completion scenarios.
The finally constructed dataset~\footnote{\url{https://huggingface.co/datasets/DaveGabe/TinyStoriesV2_cleaned}} has been uploaded to HuggingFace for reproducibility.

The backward prediction strategy shares a common logic across all model variants, which can be formalized as follows. Given the cross-entropy loss between predicted tokens and their ground truth, we compute the gradient with respect to the embedding vectors. To handle different variants consistently, we define a mapping $\phi(\be_i, \nabla_{\be_i} \Loss_{CE})$ that specifies how the gradient is interpreted for classification.
Then, the output is normalized and used with the LM Head weight matrix to get a probability distribution over the vocabulary.
The general backward prediction for any token can then be written as:

\begin{equation}
\begin{split}
    \Loss_{CE} &= CE(\by_\text{true}, \by_\text{pred}) \\
    \bg_i &= \text{LayerNorm}(\phi(\be_i, \nabla_{\be_i} \Loss_{CE})) \\
    \bz_i &= \bW_\text{LM\_head} \; \bg_i \\
    \mathbf{\hat{y}}_i &= \text{softmax}(\bz_i)
\end{split}
\label{eq:inverse_lm_backward_prediction_general}
\end{equation}

This formulation highlights a parallelism between the forward and backward modes (Figure~\ref{fig:grad_lm_head_parallelism}): the gradient vector with respect to a token embedding plays the same role as the last hidden state in next-token prediction. In particular, replacing $\nabla_{e_i}\Loss_{CE}$ with the last hidden state recovers the standard forward pass of an LLM, since both share the same dimensionality.

The final training loss combines the forward and inverse LM objectives, with $\lambda = 2.0$ found to be a suitable hyperparameter:
\begin{equation}
    \Loss =
\underbrace{\Loss_{CE}(\by_\text{true}, \by_\text{pred})}_\text{Forward: from the input x, encode y}
+
\underbrace{\lambda\ \Loss_{CE}(\bx, \mathbf{\hat{y}})}_\text{Backward: from y, decode back x}
\end{equation}

\subsection{ILM Model Variants}
We evaluate four training strategies, each differing in how gradients are used for inversion:
\newline
\textbf{Baseline.} The model is trained only with the standard forward Cross-Entropy loss, without any gradient-based inversion objective.

\noindent\textbf{Inv-First.} Only the first token of the sentence is reconstructed from its gradient, by predicting  
\begin{equation}
p(\bx_0 \mid \nabla_{\be_0} \Loss_{\text{CE}}(f_\theta(\text{\texttt{[PAD]}} \,\|\, \bx_{1:N}), \by)).
\label{eq:invfirst}
\end{equation}

\noindent\textbf{Bert-like.} A subset of tokens $\mathcal{M} \subseteq [1, N]$ is masked, and the model reconstructs only those tokens from their corresponding gradients:
\begin{equation}
p(\bx_i \mid \nabla_{\be_i} \Loss_{\text{CE}}(f_\theta(\bx_{\setminus \mathcal{M}}), \by)), 
\quad \forall i \in \mathcal{M}.
\label{eq:bert_like}
\end{equation}
This formulation is analogous to the BERT~\cite{devlin2019bert} training scheme, but operates on gradients in the backward pass.

\noindent\textbf{Identity.} During training, the model is required to reconstruct every input token directly from its corresponding gradient:  
\begin{equation}
p(\bx_i \mid \nabla_{\be_i} \Loss_{\text{CE}}(f_\theta(\bx), \by)), \quad \forall i \in [1, N].
\label{eq:identity}
\end{equation}

Each strategy comes in two approaches, depending on the implementation of the $\phi(\be_i, \nabla_{\be_i} \Loss_{CE})$ function mentioned in Equation~\ref{eq:inverse_lm_backward_prediction_general}.
When we consider gradients as directions, we classify on $\be_i - \nabla_{\be_i} \loss_{CE}$, thus by imposing
$\phi(\be_i, \nabla_{\be_i} \Loss_{CE}) = \be_i - \nabla_{\be_i} \Loss_{CE}$.
On the other hand, the case where we use the gradients as pure values is simpler: $\phi(\be_i, \nabla_{\be_i} \Loss_{CE}) = \nabla_{\be_i} \Loss_{CE}$, discarding the input embedding value.