Mining the Diamond Miner: Mechanistic Interpretability on the Video PreTraining Agent

NeurIPS 2023 Workshop ATTRIB Submission4 Authors

Published: 27 Oct 2023, Last Modified: 08 Dec 2023ATTRIB PosterEveryoneRevisionsBibTeX
Keywords: mechanistic interpretability, reinforcement learning, minecraft
TL;DR: We investigated a Video PreTraining (VPT) agent's behavior in Minecraft with Mechanistic Interpretability (MI) methods and found a significant attention head encoding for attacking actions.
Abstract: Although decision-making systems based on reinforcement learning (RL) can be widely used in a variety of applications, their lack of interpretability raises concerns, especially in high-stakes scenarios. In contrast, Mechanistic Interpretability (MI) has shown potential in breaking down complex deep neural networks into understandable components in language and vision tasks. Accordingly, in this study, we apply MI to understand the behavior of a Video PreTraining (VPT) agent, exhibiting human-level proficiency in numerous Minecraft tasks. Our exploration is centered on the task of diamond mining and its associated subtasks, such as crafting wooden logs and iron pickaxes. By employing circuit analysis, we aim to decode the network's representation of these tasks and subtasks. We find a significant head in the VPT model encoding for an attacking action, although its ablation doesn't markedly affect the agent's performance. Our findings indicate that this approach can provide useful insights into the agent's behavior.
Submission Number: 4
Loading