Keep CALM and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

ICLR 2026 Conference Submission22134 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI alignment, LLM safety, Inference-time interventions, Concept whitening, Latent space manipulation, Interpretability and controllability of LLMs
TL;DR: CALM is an inference-time method that improves LLM safety by modifying latent representations with concept whitening and projection, requiring no retraining. It complements guardrails with a lightweight, low-overhead approach.
Abstract: Large language models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Manipulation (CALM), an inference-time method that suppresses harmful concepts by modifying latent representations in the last layer of the model, without retraining. Leveraging the concept whitening technique from computer vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods on most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22134
Loading