Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning
TL;DR: We show how functionally relevant modules can be induced in RL policy networks, and develop methods to detect them and empirically interpret their functions.
Abstract: Interpretability is crucial for ensuring RL systems align with human values. However, it remains challenging to achieve in complex decision making domains. Existing methods frequently attempt interpretability at the level of fundamental model units, such as neurons or decision nodes: an approach which scales poorly to large models. Here, we instead propose an approach to interpretability at the level of functional modularity. We show how encouraging sparsity and locality in network weights leads to the emergence of functional modules in RL policy networks. To detect these modules, we develop an extended Louvain algorithm which uses a novel `correlation alignment' metric to overcome the limitations of standard network analysis techniques when applied to neural network architectures. Applying these methods to 2D and 3D MiniGrid environments reveals the consistent emergence of distinct navigational modules for different axes, and we further demonstrate how these functions can be validated through direct interventions on network weights prior to inference.
Lay Summary: Artificial intelligence systems used for decision-making, particularly reinforcement learning agents, are often treated as "black boxes". This lack of transparency creates safety concerns, especially when these AI systems are deployed in critical areas like healthcare or autonomous vehicles.
We develop a new method to make AI decision making more interpretable by encouraging neural networks to organize themselves into specialized "modules", which handle different aspects of decision making. We train these modular networks to solve simple game environments, then identify what each module does by seeing how the AI agent behaves when we modify it, for example by effectively "turning off" a module's behaviour. We also show this modular approach is useful by using it to discover that one of our agents has learnt an incorrect proxy rather than the real task it is meant to perform. This information allows us to intervene and change the agent input slightly to ensure it learns a robust policy to solve the correct task.
This modular approach offers a promising way to understand and verify AI decision-making by breaking down complex behaviors into a tractable number of interpretable components, rather than trying to analyse individual neurons.
Link To Code: https://github.com/annasoligo/BIXRL
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Reinforcement Learning, RL, Interpretability, Modularity
Submission Number: 3637
Loading