\section{Introduction}
While Large Language Models (LLMs) can handle a wide variety of natural language processing tasks and demonstrate impressive reasoning abilities, they still have notable limitations. A single foundation model can address many NLP challenges, but LLMs are prone to producing inaccurate information and are sensitive to slight changes in input, such as adversarial prompts. Recent work indicates that even nonsensical perturbations~\cite{zou2023universal,melamed2024prompts} can trigger these issues, highlighting the potential for backdoors~\cite{carlini2024poisoning}. These risks become particularly salient when LLMs are used in culturally diverse communities, where it is essential to ensure consistent behavior with local values and ethical expectations.

\begin{figure}
    \centering
    \includegraphics[alt={Illustration of the Inverse Language Modeling (ILM) framework. The forward process predicts the next token in a sequence using an autoregressive model, while the backward process reconstructs previous tokens from gradients of the prediction loss. The diagram contrasts standard forward language modeling with the proposed backward gradient-based inversion mechanism.
}, width=0.6\columnwidth]{assets/forward_backward_llm.pdf}
    \caption{Illustration of Inverse Language Modeling (ILM) setup. The forward pass predicts the next tokens, and the backward pass reconstructs the inputs from gradients.}
    \label{fig:inverse_lm_idea_schema}
\end{figure}

These problems emphasize the need for adversarial training tools (AT) for LLMs. However, the literature on this topic is not as dense as that on deep classifiers, yet the security of LLMs against adversarial perturbations remains an open challenge. Moreover, LLM training is very costly, and therefore applying AT could only worsen the issue. At the same time, beyond robustness, there is a growing demand for mechanisms that help interpret why a model produces certain responses, especially on ethically sensitive or culturally situated topics~\cite{xhonneux2024efficient}.

In this work, we define \textbf{robustness} as reduced sensitivity to adversarially perturbed prompts, and \textbf{grounding} as ensuring that LLM "know what they have been asked", addressing evidence that they often fail to represent their own knowledge faithfully~\cite{melamed2024prompts,bender2021dangers}.

In light of this, our objectives are twofold. The first centers on \textbf{Robustness}: we present a new, fast, and efficient adversarial training approach for LLMs, called \emph{Inverse Language Modeling (ILM)}. ILM draws on advances in developing robust classifiers, emphasizing that Perceptually Aligned Gradients (PAG) serve as a foundation for robustness in these models~\cite{ganz2023perceptually}.

Standard LLMs are typically trained in a forward mode, where a transformer~\cite{vaswani2017attention} predicts the continuation $\by$ of text prompt $\bx$ through self-supervision. For clarity, we define the text prompt as $\bx$, which corresponds to the input token sequence $\{\bx_0, \dots, \bx_{i-1}\}$, while the target sequence $\by$ is a one-step left-shifted version of $\bx$. In contrast, ILM extends this paradigm by introducing a backward perspective (see Figure~\ref{fig:inverse_lm_idea_schema}): given an output $\by$, the model attempts to approximate the conditioning prompt $\bx$.

The second objective relates to \textbf{Grounded LLMs} and follows naturally from the first. By enabling inversion from $\by$ back to $\bx$, ILM provides a diagnostic signal that can support auditing workflows. While it does not guarantee exact prompt recovery, it allows us to trace plausible prompt approximations that could have produced a given (potentially malicious or undesired) output. In this way, ILM helps ground model behavior in more transparent and inspectable evidence.

Importantly, ILM does not simply reverse the token sequence. Instead, it reconstructs the input prompt through a gradient-based alignment process informed by both output probabilities and intermediate representations accumulated across model layers during the forward pass.