Concept Concentration for Faithful Representation Intervention

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Representation Intervention, Safety Alignment
Abstract: Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the \textit{out-of-distribution} (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is \textit{infeasible} in the general non-linear setting. To tackle the issue, we propose \texttt{Concept Concentration} (\texttt{COCA}). Instead of identifying the faithful locations to intervene, \texttt{COCA} refactors the training data with an explicit reasoning process, which first identifies the potential unsafe concepts and then decides the responses. Essentially, \texttt{COCA} simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that \texttt{COCA} significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 11789
Loading