Beyond Analysis: Training Language Models with Internal Mechanistic Feedback

ACL ARR 2026 January Submission10765 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, Feature-Space Regularization, Circuit Attribution, Sparse Features, Training Optimization
Abstract: Recent advances in mechanistic interpretability have revealed how language models process information, yet these insights rarely improve model performance. We propose Interpretable Feature-Space Regularization (IFSR), a training-time framework that transforms mechanistic insights into optimization signals. IFSR identifies error-prone feature interactions through circuit attribution and penalizes them during training, encouraging the model to discover alternative computational pathways. Unlike inference-time interventions that operate on individual features, IFSR targets feature-to-feature edges representing internal computational patterns, and permanently encodes improvements into model parameters. Experiments across ten classification tasks show consistent improvements. Cross-task evaluation demonstrates that IFSR training can transfer positively to unrelated tasks, suggesting benefits to general model capabilities. Our analysis reveals that most identified error patterns resist human interpretation, yet penalizing them still improves performance, suggesting that automatic error identification at the level of feature interactions is feasible and effective. This work demonstrates that mechanistic interpretability can directly enhance task performance through training-time optimization.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Mechanistic Interpretability, Feature-Space Regularization, Circuit Attribution, Sparse Features, Training Optimization
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 10765
Loading