Forgive and Forget to Create Robust, Interpretable Models

Alden Bauman; Kevin Parrish

Forgive and Forget to Create Robust, Interpretable Models

Alden Bauman, Kevin Parrish

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Conceptual Interpretability, Transformer Circuits, Representation Geometry, Multilingual Models, Activation Clustering, Model Transparency, Concept Subspaces

TL;DR: We developed a trio of structural mechanisms that bring a controllable, geographic order to the weights and activations of language models and reduce friction in model interpretation.

Abstract: Reaching internal transparency is a key challenge in the development of machine learning models. Rather than trying to interpret the models’ internal structures, our approach aims to make that internal structure more interpretable. To this end, we introduce a trio of mechanisms that act on the FFNs of mT5 and the Channel Mixing layers of RWKV to produce similar outcomes: Proximal Forgetfulness, which considers weights spatially and forces them into clusters of similar magnitude; Forgiveness, which rewards close predictions to shape internal model structure and progression; and Fuzzy Recall, which shifts activations into related bands. In combination, these mechanisms are able to dramatically transform the models’ internal topology in a controllable manner without compromising the performance of pretrained networks. Additionally, these changes make the model extremely resilient against noise and spatial perturbations. We show the modified internal topology is more dependent on the loss function than specific model architecture and can be crystallized if desired when changing tasks. With this new structure in place, internal token pathways can be represented with encouraging accuracy using a series of spatial centers and magnitudes. This is done without the use of a sparse autoencoder and could open the door to simplified control and interpretation in the future.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 23567

Loading