\begin{abstract}
Interpretability and robustness remain major challenges for modern Large Language Models, especially in settings where conventional evaluation or auditing tools are limited. To address this, we propose \emph{Inverse Language Modeling} (ILM), a unified training framework that jointly enhances robustness to adversarial perturbations and enables a novel form of gradient-based interpretability. Rather than reconstructing exact input prompts, ILM encourages LLMs to develop gradient-aligned internal representations that allow the model to approximate \emph{plausible} input patterns underlying a given output. This approximate inversion provides a new mechanism for analyzing model behavior, identifying potential triggers for unsafe generations, and providing a diagnostic signal that may support future auditing workflows. Our results show that ILM can simultaneously improve robustness and produce meaningful inversion signals, laying a foundation for LLMs that are not only more resilient, but also more transparent and analyzable. Code available at \href{https://github.com/davegabe/pag-llm}{https://github.com/davegabe/pag-llm}.
\end{abstract}

\keywords{LLM \and Invertibility  \and Adversarial Training \and Gradients \and Robustness.}