TL;DR: Sparse autoencoders trace out-of-distribution prompts, such as typos and jailbreaks, to an expansion of active internal concepts, enabling a mechanistically grounded alignment strategy to harden large language models against generalization failures.
Abstract: Pre-trained transformers have demonstrated remarkable generalization abilities, at times extending beyond the scope of their training data. Yet, real-world deployments often face unexpected or adversarial data that diverges from training data distributions. Without explicit mechanisms for handling such shifts, model reliability and safety degrade, urging more disciplined study of out-of-distribution (OOD) settings for transformers. By systematic experiments, we present a mechanistic framework for delineating the precise contours of transformer model robustness. We find that OOD inputs, including subtle typos and jailbreak prompts, drive language models to operate on an increased number of fallacious concepts in their internals. We leverage this device to quantify and understand the degree of distributional shift in prompts, enabling a mechanistically grounded fine-tuning strategy to robustify LLMs. Expanding the very notion of OOD from input data to a model’s private computational processes—a new transformer diagnostic at inference time—is a critical step toward making AI systems safe for deployment across science, business, and government.
Lay Summary: Large language models (LLMs) often fail or degrade unpredictably when encountering prompts that are different from the data they were trained on, such as prompts with minor typos or adversarial attacks. Identifying and correcting these internal failure modes is difficult because the inner computational processes of transformers are thought to be a black box.
This research repurposes Sparse Autoencoders (SAEs) as a diagnostic tool to monitor the internal activation space of language models during inference. SAEs serve to identify human understandable concepts within the internals of LLMs. We found that atypical inputs force models to process data using an expanded number of unnecessary, spurious concepts. Leveraging this insight, we developed a metric to pinpoint these shifts and implemented a lightweight fine-tuning strategy to align internal representations back toward a normal reference baseline.
This framework provides an interpretable mechanism to audit LLMs in real-time before they even generate a response. By suppressing distracting, spurious concepts, this framework drastically hardens AI systems against adversarial attacks while fully preserving general reasoning capabilities.
Originally Submitted Supplementary Material: zip
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: robustness, generalization, sparse autoencoders, transformers
Originally Submitted PDF: pdf
Submission Number: 25938
Loading