\section{Prior Work}
Research on adversarial attacks against LLMs has advanced significantly, particularly in the generation of adversarial suffixes designed to bypass alignment safeguards. Early techniques, such as HotFlip~\cite{ebrahimi2018hotflip} and Greedy Coordinate Gradient (GCG)~\cite{zou2023universal}, focused on manipulating the input text or its embedding gradients to induce undesirable behavior from LLMs. GCG modifies token selections iteratively based on gradient information. Subsequent enhancements, including Probe Sampling~\cite{zhao2024accelerating} and token similarity-based heuristics~\cite{li2024faster}, improved the efficiency of this search process.

More recent methods include AutoDAN~\cite{liu2023autodan}, which leverages genetic algorithms to produce fluent and stealthy adversarial suffixes, and its successor AutoDAN Turbo~\cite{liu2024autodan}, which coordinates multiple LLMs for strategy development and attack evaluation. AdvPrompter~\cite{paulus2024advprompter} takes a different approach by fine-tuning a model specifically to generate coherent adversarial suffixes, allowing fast and automated jailbreaking.

On the defensive side, perplexity-based filtering~\cite{alon2023detecting} has proven effective at identifying adversarial suffixes by exploiting their typically high perplexity. However, newer attacks are designed to bypass such detection mechanisms by optimizing fluency and semantic plausibility.
In addition, work on language model inversion~\cite{morris2023language} explores the recovery of original prompts from output probabilities, similar to reconstruction techniques in computer vision. These findings have informed strategies for generating adversarial prompts using only output distributions.

\begin{table}
\centering
    \caption{Original inputs $\bx$ and adversarial examples $\bxa$ generated using the GCG method for the SmolLM-360M model. The table shows that each original input and its corresponding adversarial example yield the same output, along with the loss values for the output token IDs.
    These examples show that LLMs can be manipulated into assigning lower loss to nonsensical prompts than to the original, meaningful input -- highlighting a vulnerability that ILM is designed to address.}
    \label{tab:evil_twins_examples}
    \footnotesize
    \resizebox{\textwidth}{!}{
    \begin{tabular}{llc}\toprule
    \textbf{Input} & \textbf{Output $\by$} & \textbf{Loss} \\
    \midrule
    %\cline{2-3}
    $\bx$~: Stevens recorded and produced the album at multiple & \multirow{2}{*}{locations in the United}  & 5.3642 \\
    $\bxa$: Zo Certified Cities (. broadcastquartered Fitness Academy thirteen   & &  \textbf{5.1302} \\
    \midrule
    $\bx$~: After the introduction of the Majors , The   & \multirow{2}{*}{British Army was divided}   & 11.2146 \\
    $\bxa$: REQU Apart British received reformsMilitaryestic Division The  & &  \textbf{7.1899} \\
    \midrule
    $\bx$~: The founding director , Peggy Loar , left   & \multirow{2}{*}{the University of California}   & 7.2669\\
    $\bxa$: tested UberERIC definitionCalifornia sustainability RutgersOL Jensen regarding  & &  \textbf{6.4402} \\
    \midrule
    $\bx$~: Ruiz notes that writing also has the power & \multirow{2}{*}{\centering to change the world} & 5.9135 \\
    $\bxa$: Report Global feminism agenda Representatives tell Sacredixties Trying & & \textbf{4.6041} \\
    \bottomrule
    \end{tabular}
    }
\end{table}

Unlike prior work focused on suffix generation or language inversion as an offensive tool, our research seeks to understand and mitigate these vulnerabilities.
In particular, we study "evil twin" prompts as defined in~\cite{melamed2024prompts,rakotonirina2025evil}. Given a text prompt $\bx$ and the completion $\by$, we perform an optimization so that given $\by$, we find a new nonsensical $\bxa$ -- the "evil twin" -- such that the loss $\Loss(\bxa,\by;\net) \ll \Loss(\bx,\by;\net)$, where $\Loss$ is the next-token prediction loss of LLM and $\net$ are LLM's parameters. These $\bxa$ are syntactically implausible out-of-distribution inputs that nevertheless lead to the same output, as illustrated in Table~\ref{tab:evil_twins_examples}. Despite producing identical continuations, $\bx$ and $\bx^\star$ induce notably different
entropy distributions.
These prompts are also fragile -- small changes typically break the adversarial effect, highlighting a key vulnerability in LLM robustness and alignment.


To address evil twin prompts, we propose Inverse Language Modeling, a novel training framework that improves LLM robustness. ILM enables both forward modeling and partial inversion, encouraging the model not only to generate fluent output but also to remain sensitive to input semantics.
