Keywords: deep learning, normalization, gradient interference, isometry, reinforcement learning, off-policy RL, offline RL, deep RL
Abstract: Layer normalization ($\textsf{LN}$) is among the most effective normalization schemes for deep $Q$-learning. However, its benefits remain not fully understood. We find *gradient interference* a promising lens through which to study these benefits. A gradient interference metric used in prior works is the inner product between semi-gradients of the temporal difference error on two random samples. We argue that, from the perspective of minimizing the loss, a more principled metric is to calculate the inner product between a semi-gradient and a full-gradient. We test this argument with offline deep $Q$-learning, without a target network, on four classic control tasks. Counterintuitively, we find empirically that first-order gradient interference metrics *positively* correlate with the training loss. We then find that a second-order gradient interference metric avoids this counterintuitive result, i.e. gives *negative* correlation. Theoretically, we provide supporting arguments from the linear regression setting.
Submission Number: 169
Loading