Optimal learning rate scaling depends on data in deep scalar linear networks

Yedi Zhang; Peter E. Latham; Leena Chennuru Vankadara; Andrew M Saxe

Optimal learning rate scaling depends on data in deep scalar linear networks

Yedi Zhang, Peter E. Latham, Leena Chennuru Vankadara, Andrew M Saxe

Published: 02 Mar 2026, Last Modified: 12 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: scaling law, hyperparameter transfer, gradient descent, learning dynamics

TL;DR: We show that in a simple deep scalar linear network, the optimal depth-wise learning rate scaling depends on data, while data-agnostic scaling rules fail to transfer across depths.

Abstract: We study the gradient descent dynamics of deep scalar linear networks, which enjoy exact time-course solutions for any integer depth. We show that even in this minimal model, the optimal depth-wise learning rate scaling depends on data, whereas data-agnostic scaling rules fail to transfer across depths. Under the data-dependent optimal scaling, the learning dynamics is independent of data and weakly dependent on depth, resulting in a constant linear convergence rate across all depths including infinity. We further show similar data-dependent effects in deep scalar linear networks with residual connections.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 50

Loading