Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Published: 30 Sept 2025, Last Modified: 10 Nov 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Open Source Links: https://github.com/AlignmentResearch/learned-planner
Keywords: Circuit analysis, Reinforcement learning, AI Safety
TL;DR: We reverse-engineer the planning algorithm learned by an RNN that plays Sokoban.
Abstract: We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call _path channels_. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned _transition model_. The RNN constructs plans by starting at the boxes and goals. These kernels, _extend_ activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.
Submission Number: 131
Loading