Mechanistic Interpretability for AI Safety - A Review

Leonard Bereska; Stratis Gavves

Mechanistic Interpretability for AI Safety - A Review

Leonard Bereska, Stratis Gavves

Published: 09 Sept 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Authors that are also TMLR Expert Reviewers: ~Stratis_Gavves1

Abstract: Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as features encoding knowledge within neural activations and hypotheses about their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We examine benefits in understanding, control, alignment, and risks such as capability gains and dual-use concerns. We investigate challenges surrounding scalability, automation, and comprehensive interpretation. We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors and expand to domains such as vision and reinforcement learning. Mechanistic interpretability could help prevent catastrophic outcomes as AI systems become more powerful and inscrutable. For an HTML version of the paper, visit https://leonardbereska.github.io/blog/2024/mechinterpreview/.

Certifications: Survey Certification, Expert Certification

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~antonio_vergari2

Submission Number: 2555

Loading