LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

Published: 23 Sept 2025, Last Modified: 09 Oct 2025RegML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Safety Alignment, Variational Autoencoder, Latent Space Steering
TL;DR: LatentGuard uses structured VAEs to learn interpretable adversarial features in LLM representations, enabling controllable safety refusal while preserving utility.
Abstract: Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LatentGuard, a novel three-stage framework that combines reasoning-aware behavioral alignment with supervised latent space control for interpretable and precise safety steering. Our approach first fine-tunes an LLM on rationalized datasets containing both reasoning-enhanced refusals to adversarial prompts and compliant responses to benign queries, establishing robust behavioral priors for safety-critical and utility-preserving scenarios. We then train a structured variational autoencoder (VAE) on intermediate MLP activations, supervised by multi-label annotations including attack types, attack methods, and benign indicators. This structured supervision enables the VAE to learn disentangled and semantically interpretable latent dimensions that capture distinct safety-relevant factors. By selectively manipulating these latent dimensions, LatentGuard achieves controlled refusal behaviors—effectively mitigating harmful requests while maintaining appropriate responsiveness to legitimate ones. Comprehensive experiments on Qwen3-8B demonstrate statistically significant gains in both safety controllability and interpretability without degrading model utility. Cross-architecture evaluation on Mistral-7B further supports the robustness and transferability of our approach. While code and models are not publicly released due to potential misuse concerns, we provide detailed methodological descriptions to support reproducibility. Overall, our results highlight structured representation-level intervention as a practical and transparent pathway toward safer, more controllable LLMs.
Submission Number: 3
Loading