Keywords: Mechanistic Interpretability, Language Models, Chess, World Models
TL;DR: This work investigates how a GPT-2 style transformer trained on chess computes linear board representations.
Abstract: The field of mechanistic interpretability seeks to understand the internal workings of neural networks, particularly language models. While previous research has demonstrated that language models trained on games can develop linear board representations, the mechanisms by which these representations arise are unknown. This work investigates the internal workings of a GPT-2 style transformer trained on chess PGNs, and proposes an algorithm for how the model computes the board state.
Code: zip
Submission Number: 49
Loading