MoSSAIC: AI Safety After Mechanism

Matt Farr; Aditya Arpitha Prasad; Chris Pang; Sahil K

MoSSAIC: AI Safety After Mechanism

Matt Farr, Aditya Arpitha Prasad, Chris Pang, Sahil K

01 Jul 2025 (modified: 02 Jul 2025)ODYSSEY 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Substrate, Agent foundations, Deep deceptiveness, Robust agent-agnostic processes

TL;DR: Causal--mechanistic paradigm in mechanistic interpretability. Why it might fail as intelligence scales. Supplementary framework, MoSSAIC, to address it.

Abstract: We identify a causal--mechanistic paradigm in AI safety, primarily through the example of mechanistic interpretability. Recent results suggest limits to this paradigm's utility in answering questions about the safety of neural networks, and we argue further that those results give a taste of what is to come, by proposing a sequence of scenarios in which safety affordances based upon the causal--mechanistic paradigm break down. This analysis conceptually connects current obfuscation results with some of MIRI's more pessimistic threat models (e.g., deep deceptiveness, robust agent-agnostic processes) and suggest how we might unify all under a common framework. The paper then introduces a supplementary framework, MoSSAIC (Management of Substrate-Sensitive AI Capabilities), which addresses some of the core assumptions that underlie the causal--mechanistic paradigm, and we sketch out the complementary research infrastructure we are currently designing to allow us to keep pace with evasive intelligence.

Serve As Reviewer: ~Aditya_Arpitha_Prasad1

Confirmation: I confirm that I and my co-authors have read the policies are releasing our work under a CC-BY 4.0 license.

Submission Number: 8

Loading