Interpretability as Alignment: Making Internal Understanding a Design Principle

Published: 03 Nov 2025, Last Modified: 05 Dec 2025EurIPS 2025 Workshop PAIG PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, Causal Abstraction, Private AI Governance, Transparency-by-Design, Accountability Infrastructure, Audit Hooks, Provenance Tracking, Assurance Evidence, Governance-Ready AI .
Abstract: Frontier AI systems require governance mechanisms that can verify internal alignment, not just behavioral compliance. Private governance mechanisms - audits, certification, insurance, and procurement are emerging to complement public regulation, but they require technical substrates that generate verifiable causal evidence about model behavior. This paper argues that mechanistic interpretability provides this substrate. We frame interpretability not as post-hoc explanation but as a design constraint embedding auditability, provenance, and bounded transparency within model architectures. Integrating causal abstraction theory and empirical benchmarks such as MIB and LoBOX, we outline how interpretability-first models can underpin private assurance pipelines and role-calibrated transparency frameworks. This reframing situates interpretability as infrastructure for private AI governance—bridging the gap between technical reliability and institutional accountability.
Submission Number: 23
Loading