Optimal learning rate scaling depends on data in deep scalar linear networks

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: scaling law, hyperparameter transfer, learning dynamics, gradient descent
TL;DR: We show that in a simple deep scalar linear network, the optimal depth-wise learning rate scaling depends on data, while data-agnostic scaling rules fail to transfer across depths.
Abstract: We study the gradient descent dynamics of deep scalar linear networks, which enjoy exact time-course solutions for any integer depth. We show that even in this minimal model, the optimal depth-wise learning rate scaling depends on data, whereas data-agnostic scaling rules fail to transfer across depths. Under the data-dependent optimal scaling, the learning dynamics is independent of data and weakly dependent on depth, resulting in a constant linear convergence rate across all depths including infinity. We further show similar data-dependent effects in deep scalar linear networks with residual connections.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 113
Loading