Bilinear Convolution Decomposition for Causal RL Interpretability

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning
Other Keywords: RL Interp, bilinear convolutions, weight interpretability, bilinear gating
TL;DR: We replace ReLU in convolutional networks with a bilinear gating function allowing us to decompose weights and interprete weight components.
Abstract: Efforts to interpret reinforcement learning (RL) models tend to target the activation space, and fewer recent studies target the weight space. Here we use a dual framework of both the weight and activation spaces in order to interpret and intervene in a RL network. To enhance RL interpretability, we enable linear decomposition via linearization of an IMPALA network : we replace nonlinear activation functions in both convolution and fully connected layers with bilinear variants (we term BIMPALA). Previous work on MLPs have shown that bilinearity enables quantifying functional importance through weight-based eigendecomposition to identify interpretable low rank structure \citep{pearce_bilinear_2024}. By extending existing MLP decomposition techniques to convolution layers, we are able to analyze channel and spatial dimensions separately through singular value decomposition. We find BIMPALA networks to be feasible and competitive, as they perform comparably to their ReLU counterparts when we train them on various ProcGen games. Importantly, we find the bilinear approach in combination with activation-based probing provide advantages for interpretability and agent control. In a maze-solving agent, we find a set of orthonomal eigenvectors (we term \textit{eigenfilters}), the top-2 of which act as cheese (solution target) detectors, and another pair of eigenfilters we can manipulate to control the policy.
Submission Number: 234
Loading