Understanding Skill Adaptation in Transformers Using Sparse Autoencoders: Chess as a Model System

Difan Jiao; George Eilender; Zhenwei Tang; Ashton Anderson

Understanding Skill Adaptation in Transformers Using Sparse Autoencoders: Chess as a Model System

Difan Jiao, George Eilender, Zhenwei Tang, Ashton Anderson

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Skill Adaptation, Chess, Sparse Autoencoders, Mechanistic Interpretability

Abstract: Understanding how skill shapes decision-making in complex environments is a challenging problem in AI interpretability. We investigate this question by applying Sparse Autoencoders (SAEs) to the internal representations of Maia-2, a human-like chess model that simulates human play across varying skill levels. Maia-2 incorporates a skill-aware transformer that integrates position features with categorical skill inputs, capturing nuanced relationships between player expertise and move selection. By training SAEs on these modulated representations, we identify latent features that reveal how the model's threat response policy adapts to different levels of play. We then use these features to intervene on the internal activations of Maia-2, eliciting both higher skill and lower skill play in specific contexts. We also apply mediated intervention with targeted SAE features to effectively enhance and sabotage the model's understanding and decision-making on context-specific chess tasks. Our findings suggest that SAE features can help shed light on how skill-specific information is encoded within a model to produce human-like behavior, and that these insights can be applied to steer the model's performance on specific sub-tasks. Our work is available at \url{https://anonymous.4open.science/r/chess-sae-3C06/}

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13473

Loading