Mechanistic Interpretability for AI Safety - A Review

TMLR Paper2555 Authors

19 Apr 2024 (modified: 21 Jun 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As artificial intelligence (AI) systems rapidly advance, understanding their inner workings is crucial for ensuring alignment with human values and safety. This review explores mechanistic interpretability, which aims to reverse-engineer the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts, focusing on a granular, causal understanding of how AI models operate. We establish foundational concepts, including features as units encoding knowledge within neural activations and hypotheses surrounding their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We examine benefits in understanding, control, and alignment while discussing risks like capability gains and dual-use concerns. We examine the challenges of scalability, automation, and comprehensive understanding. We advocate for future work clarifying core concepts, setting rigorous standards, scaling up techniques to handle complex models and behaviors, and expanding the scope to domains like vision and reinforcement learning.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~antonio_vergari2
Submission Number: 2555
Loading