Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought

Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought

ICLR 2026 Conference Submission24035 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safety Alignment

Abstract: Reasoning-based methods have emerged to overcome the limitations of 'shallow alignment' by exposing the model's Chain-of-Thought (CoT), enabling auditability through both training-phase supervision and post-generation verification. However, this transparency creates a critical vulnerability, a tension we define as the \textbf{Security-auditability Dilemma}: the very mechanism of exposing the model’s safety reasoning for auditability inadvertently leaks harmful information and creates a vulnerable attack surface against adaptive attacks. To address this, we propose \textbf{Auditable Latent CoT Alignment (ALCA)}, a framework that decouples internal reasoning from external output. ALCA shifts the safety deliberation process into a continuous latent space, rendering it opaque to adversaries. Yet, this process is not a black box; we introduce a \textbf{Self-Decoding} mechanism that allows the model to reconstruct its latent reasoning into human-readable text for supervisory auditing. Extensive experiments show that ALCA achieves robustness alignment, reducing the success rate of adaptive jailbreak attacks by over 54\% compared to strong baselines, while preserving performance. Our framework presents a path toward building LLMs that are both robustly secure and auditable.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 24035

Loading