How Does Layer Normalization Improve Deep $Q$-learning?

Braham Snyder; Hadi Daneshmand; Chen-Yu Wei

How Does Layer Normalization Improve Deep $Q$-learning?

Braham Snyder, Hadi Daneshmand, Chen-Yu Wei

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: deep learning, normalization, gradient interference, isometry, reinforcement learning, off-policy RL, offline RL, deep RL

Abstract: Layer normalization ($\textsf{LN}$) is among the most effective normalization schemes for deep $Q$-learning. However, its benefits remain not fully understood. We find *gradient interference* a promising lens through which to study these benefits. A gradient interference metric used in prior works is the inner product between semi-gradients of the temporal difference error on two random samples. We argue that, from the perspective of minimizing the loss, a more principled metric is to calculate the inner product between a semi-gradient and a full-gradient. We test this argument with offline deep $Q$-learning, without a target network, on four classic control tasks. Counterintuitively, we find empirically that first-order gradient interference metrics *positively* correlate with the training loss. We then find that a second-order gradient interference metric avoids this counterintuitive result, i.e. gives *negative* correlation. Theoretically, we provide supporting arguments from the linear regression setting.

Submission Number: 169

Loading