Stay Centered: Semantic Barycenter Alignment for LLM Jailbreak Defense

13 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak, LLM security, Jailbreak defense, LLMs
TL;DR: SBA is a theoretically grounded defense method that redefines intent extraction as an optimal transport (OT) problem over a semantic manifold.
Abstract: Jailbreak defenses in large language models (LLMs) are essential for ensuring model security, maintaining user trust, and supporting the sustainable development of AI applications. However, the most widely adopted defenses based on intent extraction currently rest on unstable foundations, marked by excessive dependence on target/safety LLM's security performance, prompts engineering, and limited explainability. These limitations render the defense architecture inherently passive, struggling to effectively counter evolving jailbreak. In this paper, we propose Semantic Barycenter Alignment (SBA), a novel defense method grounded in optimal transport (OT) theory. Specifically, we reinterpret intent extraction as a semantic projection task on a latent embedding manifold, mapping each user prompt to its class barycenter—a region models recognize more easily. In this process, we instruction-tune an LLM to extract input intents and use Sinkhorn divergence to quantify semantic alignment with target intents, measuring minimal deformation in Wasserstein space. During defense, the intent extractor serves as the upstream stage in the defense pipeline, passing intents aligned with the manifold barycenter to lightweight safety LLMs (e.g., Llama-Guard3-1B) for jailbreak detection. Empirical results show that SBA exhibits zero-shot robustness and steers intent embeddings toward the manifold barycenter of their semantic classes, reducing intra-class variance and simplifying downstream jailbreak detection. Moreover, built on a principled OT theory, SBA offers greater interpretability, removes reliance on prompt-specific heuristics, and reduces dependency on downstream LLM performance—providing a more proactive foundation for jailbreak defense. Code: \url{https://anonymous.4open.science/r/SBA-EE42}. \textcolor{red}{Warning: This paper may contain content that has the potential to be offensive and harmful.}
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4921
Loading