Towards Meta-Models for Automated Interpretability

Lauro Langosco; William Baker; Neel Alex; Herbie Bradley; David Quarel; David Krueger

Towards Meta-Models for Automated Interpretability

Lauro Langosco, William Baker, Neel Alex, Herbie Bradley, David Quarel, David Krueger

26 Sept 2024 (modified: 12 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: interpretability, safety, automated interpretability, ai safety, explainability, extraction, tracr, rasp

Abstract: Previous work has demonstrated that in some settings, the mechanisms implemented by small neural networks can be reverse-engineered. However, these efforts rely on human labor that does not easily scale. To investigate a potential avenue towards scalable interpretability, we show it is possible to use \emph{meta-models}, neural networks that take another network's parameters as input, to learn a mapping from transformer weights to human-readable code. We build on RASP and Tracr to synthetically generate transformer weights that implement known programs, then train a transformer to extract RASP programs from weights. Our trained compiler effectively extracts algorithms from model weights, reconstructing a fully correct algorithm 60% of the time.

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7595

Loading