Optimal learning rate scaling depends on data in deep scalar linear networks

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: scaling law, hyperparameter transfer, gradient descent, learning dynamics
TL;DR: We show that in a simple deep scalar linear network, the optimal depth-wise learning rate scaling depends on data, while data-agnostic scaling rules fail to transfer across depths.
Abstract: We study the gradient descent dynamics of deep scalar linear networks, which enjoy exact time-course solutions for any integer depth. We show that even in this minimal model, the optimal depth-wise learning rate scaling depends on data, whereas data-agnostic scaling rules fail to transfer across depths. Under the data-dependent optimal scaling, the learning dynamics is independent of data and weakly dependent on depth, resulting in a constant linear convergence rate across all depths including infinity. We further show similar data-dependent effects in deep scalar linear networks with residual connections.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 50
Loading