Keywords: Steering, Causal interventions, Reinforcement learning, Understanding high-level properties of models
TL;DR: Our work demonstrates that entities can be multiplexed within single channels in a neural network.
Abstract: In this work we provide an extensive analysis into the operations of a maze solving reinforcement learning agent trained in the Procgen Heist environment. We target this model because it presented a high degree of polysemanticity due to the fact that it has to target multiple different entities to succeed. By focusing on an agent that has to target multiple similar entities we hope to answer questions about how each of these entities might be processed by the network. Our main finding is that the signals related to the targeting of different entities are encoded at different activation strengths within a single channel in the network. These "steering channels" are often highly redundant, with large numbers of channels enabling precise agent steering, but often only within narrow ranges of activation values. We also discover a paradoxical ablation effect in which removing both steering channels and navigation circuits improves entity collection rates compared to partial ablation, suggesting unexpected interference between these systems. These findings demonstrate that amplitude-based multiplexing is a fundamental strategy for encoding multiple goals in RL agents, while our counterintuitive ablation studies suggest surprising specialization and informational dependencies within the network.
Submission Number: 207
Loading