Keywords: Skill Adaptation, Chess, Sparse Autoencoders, Mechanistic Interpretability
Abstract: Understanding how skill shapes decision-making in complex environments is a challenging problem in AI interpretability. We investigate this question by applying Sparse Autoencoders (SAEs) to the internal representations of Maia-2, a human-like chess model that simulates human play across varying skill levels. Maia-2 incorporates a skill-aware transformer that integrates position features with categorical skill inputs, capturing nuanced relationships between player expertise and move selection. By training SAEs on these modulated representations, we identify latent features that reveal how the model's threat response policy adapts to different levels of play. We then use these features to intervene on the internal activations of Maia-2, eliciting both higher skill and lower skill play in specific contexts. We also apply mediated intervention with targeted SAE features to effectively enhance and sabotage the model's understanding and decision-making on context-specific chess tasks. Our findings suggest that SAE features can help shed light on how skill-specific information is encoded within a model to produce human-like behavior, and that these insights can be applied to steer the model's performance on specific sub-tasks. Our work is available at \url{https://anonymous.4open.science/r/chess-sae-3C06/}
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13473
Loading