Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban
Keywords: Interpretability, Mechanistic Interpretability, Planning, Search, LSTM
TL;DR: We reverse-engineer an RNN playing Sokoban, finding an emergent planning algorithm analogous to bidirectional search. It has specialized plan-extension kernels that form a transition model and a value function that decides when to backtrack.
Abstract: We partially reverse-engineer a convolutional recurrent neural network (RNN) trained to play the puzzle game Sokoban with model-free reinforcement learning.
Prior work found that this network solves more levels with more test-time compute.
Our analysis reveals several mechanisms analogous to components of classic bidirectional search.
For each square, the RNN represents its plan in the activations of channels associated with specific directions.
These state-action activations are analogous to a _value function_ – their magnitudes determine when to backtrack and which plan branch survives pruning.
Specialized kernels extend these activations (containing plan and value) forward and backward to create paths, forming a _transition model_.
The algorithm is also _unlike_ classical search in some ways. State representation is not unified; instead, the network considers each box separately. Each layer has its own plan representation and value function, increasing search depth.
Far from being inscrutable, the mechanisms leveraging test-time compute learned in this network by model-free training can be understood in familiar terms.
Supplementary Material: zip
Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 9446
Loading