Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers

Quoc-Vinh Lai-Dang; Taemin Kang; Seungah Son

Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers

Quoc-Vinh Lai-Dang, Taemin Kang, Seungah Son

Published: 22 Jan 2025, Last Modified: 28 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Transformers, Interpretable AI

Abstract: Balancing high performance with interpretability in increasingly powerful Transformer-based models remains a challenge. While mechanistic interpretability aims to specify neural network computations in explicit, pseudocode-like formats, existing methods often involve laborious manual analysis or struggle to fully elucidate learned internal algorithms. Recent efforts to build intrinsically interpretable models have introduced considerable expressivity and optimization challenges. This work introduces Adaptive Transformer Programs, an enhanced framework building upon RASP language and Transformer Programs to create more robust and interpretable models. The proposed method increases expressivity by redesigning two primary attention modules to improve categorical and numerical reasoning capabilities. To overcome optimization hurdles, we introduce a novel reparameterization scheme that enhances the exploration-exploitation trade-off during training. We validate our approach through extensive experiments on diverse tasks, including in-context learning, algorithmic problems (e.g., sorting and Dyck languages), and NLP benchmarks such as named entity recognition and text classification. Results demonstrate that Adaptive Transformer Programs substantially narrow the performance gap between black-box Transformers and interpretable models, enhancing transparency. This work advances the development of high-performing, transparent AI systems for critical applications, addressing crucial ethical concerns in AI development.

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6447

Loading