Keywords: Reinforcement learning
Other Keywords: RL Interp, bilinear convolutions, weight interpretability, bilinear gating
TL;DR: We replace ReLU in convolutional networks with a bilinear gating function allowing us to decompose weights and interprete weight components.
Abstract: Efforts to interpret reinforcement learning (RL) models tend to target the activation
space, and fewer recent studies target the weight space. Here we use a dual frame-
work of both the weight and activation spaces in order to interpret and intervene in
a RL network. To enhance RL interpretability, we enable linear decomposition via
linearization of an IMPALA network : we replace nonlinear activation functions
in both convolution and fully connected layers with bilinear variants (we term
BIMPALA). Previous work on MLPs have shown that bilinearity enables quantify-
ing functional importance through weight-based eigendecomposition to identify
interpretable low rank structure [Pearce et al., 2024b]. By extending existing MLP
decomposition techniques to convolution layers, we are able to analyze channel
and spatial dimensions separately through singular value decomposition. We find
BIMPALA networks to be feasible and competitive, as they perform comparably
to their ReLU counterparts when we train them on various ProcGen games. Impor-
tantly, we find the bilinear approach in combination with activation-based probing
provide advantages for interpretability and agent control. In a maze-solving agent,
we find a set of orthonomal eigenvectors (we term eigenfilters), the top-2 of which
act as cheese (solution target) detectors, and another pair of eigenfilters we can
manipulate to control the policy.
Submission Number: 234
Loading