Keywords: Safety Alignment
Abstract: Reasoning-based methods have emerged to overcome the limitations of 'shallow alignment' by exposing the model's Chain-of-Thought (CoT), enabling auditability through both training-phase supervision and post-generation verification. However, this transparency creates a critical vulnerability, a tension we define as the \textbf{Security-auditability Dilemma}: the very mechanism of exposing the model’s safety reasoning for auditability inadvertently leaks harmful information and creates a vulnerable attack surface against adaptive
attacks. To address this, we propose \textbf{Auditable Latent CoT Alignment (ALCA)}, a framework that decouples internal reasoning from external output. ALCA shifts the safety deliberation process into a continuous latent space, rendering it opaque to adversaries. Yet, this process is not a black box; we introduce a \textbf{Self-Decoding} mechanism that allows the model to reconstruct its latent reasoning into human-readable text for supervisory auditing. Extensive experiments show that ALCA achieves robustness alignment, reducing the success rate of adaptive jailbreak attacks by over 54\% compared to strong baselines, while preserving performance. Our framework presents a path toward building LLMs that are both robustly secure and auditable.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24035
Loading