A Safety Case for a Deployed LLM: Corrigibility as a Singular Target

24 Jun 2025 (modified: 01 Jul 2025)ODYSSEY 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Corrigibility, Alignment, Safety Case, AI Debate, Scalable Oversight, Prover-Estimator, LLM
TL;DR: Safety case for a corrigible large language model trained via Prover-Estimator Debate. Based on the Corrigibility-as-Singular-Target (CAST) strategy and explicitly confronts the limitations of debate-based alignment.
Abstract: This document presents a detailed safety case for deploying a highly capable LLM for real-world action under the guidance of a trusted principal. The system is trained according to the Corrigibility-as-Singular-Target (CAST) strategy, using a Prover-Estimator debate framework. This process instills a singular behavioral objective: the agent is incentivized to view itself as a potentially flawed tool and to proactively empower its principal's oversight and correction. This safety case moves beyond prior work focused on sandboxed environments to confront the challenges of real-world deployment. It argues for a set of top-level claims: that the deployment specifications are adequate, that the agent's error rate is bounded and detectable, that the impact of errors is mitigated, and that these properties are stable over a defined lifetime. The case is presented not as a declaration of safety, but as a structured argument intended for rigorous critique. It explicitly confronts the limitations of AI Debate as a prosaic alignment technique, highlighting where the evidence required is promissory and where the deepest vulnerabilities lie, especially in the face of superhuman capabilities.
Serve As Reviewer: ~Ram_Potham1
Confirmation: I confirm that I and my co-authors have read the policies are releasing our work under a CC-BY 4.0 license.
Submission Number: 3
Loading