everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Understanding how skill shapes decision-making in complex environments is a challenging problem in AI interpretability. We investigate this question by applying Sparse Autoencoders (SAEs) to the internal representations of Maia-2, a human-like chess model that simulates human play across varying skill levels. Maia-2 incorporates a skill-aware transformer that integrates position features with categorical skill inputs, capturing nuanced relationships between player expertise and move selection. By training SAEs on these modulated representations, we identify latent features that reveal how the model's threat response policy adapts to different levels of play. We then use these features to intervene on the internal activations of Maia-2, eliciting both higher skill and lower skill play in specific contexts. We also apply mediated intervention with targeted SAE features to effectively enhance and sabotage the model's understanding and decision-making on context-specific chess tasks. Our findings suggest that SAE features can help shed light on how skill-specific information is encoded within a model to produce human-like behavior, and that these insights can be applied to steer the model's performance on specific sub-tasks. Our work is available at \url{https://anonymous.4open.science/r/chess-sae-3C06/}