Bias-mediated early linearization drives simplicity bias in ReLU networks

ICLR 2026 Conference Submission20978 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Learning dynamics, Simplicity bias, ReLU networks
TL;DR: When we equip ReLU networks with bias terms they learn very similar to linear networks.
Abstract: ReLU networks and their variants are a key building block of modern deep learning architectures. Despite their ubiquity, our understanding of learning dynamics in these models is still limited. Previous work has relied on a strong set of simplifying assumptions such as the removal of bias terms or predefined gating structures. Here, we explore how the inclusion of bias terms influences learning dynamics in ReLU networks in the rich learning regime. Surprisingly, we observe that the inclusion of bias terms can simplify learning dynamics, i.e. ReLU networks with bias terms have learning dynamics that are strongly aligned to those of well-understood linear models. We explain this similarity by showing that bias terms push ReLU units into a transient linear regime driving subsequent linear learning. We validate the practical scope of our results by showing that bias terms as well as input constants induce linear-like learning on image data and non-linear problems. Importantly, bias-induced early linear learning has substantial implications for simplicity biases: Networks with an early linear phase have a low-rank bias, prefer linear solution on the data, and generalize better. Our results make explicit that bias terms are an important factor that contribute to simplicity biases in neural networks by enhancing preferences for simple solutions. Overall, our findings suggest that observed similarities between linear and nonlinear systems and simplicity biases are connected to the linearizing effect of bias terms.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 20978
Loading