Limitation of Characterizing Implicit Regularization by Data-independent Functions
Abstract: In recent years, understanding the implicit regularization of neural networks (NNs) has become a central task in deep learning theory. However, implicit regularization is itself not completely defined and well understood. In this work, we attempt to mathematically define and study implicit regularization. Importantly, we explore the limitations of a common approach to characterizing implicit regularization using data-independent functions. We propose two dynamical mechanisms, i.e., Two-point and One-point Overlapping mechanisms, based on which we provide two recipes for producing classes of one-hidden-neuron NNs that provably cannot be fully characterized by a type of or all data-independent functions. Following the previous works, our results further emphasize the profound data dependency of implicit regularization in general, inspiring us to study in detail the data dependency of NN implicit regularization in the future.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=140kSqm0uy&nesting=2&sort=date-desc
Changes Since Last Submission: We have made the following changes according to the suggestion by AE: 1. In the last but one paragraph of related works, we mentioned others assumed global-min convergence: "...by providing examples of gradient flows for one-neuron ReLU NNs which converge to global minima..." 2. In our definition of $M_S$ (below Definition 3.3) we stressed the motivation for defining it, and we mentioned that this assumption is made by previous work: "...In our discussion about implicit regularization below, we will focus on the scenarios (see 4.1) in which training dynamics converge to global minima. Such focus is common in the study of implicit regularization, for example, Vardi & Shamir (2021)..." 3. Before we define implicit and explicit regularization we emphasize this again: "...For implicit regularizations, we focus on methods that find the global minima of a loss function $L_S$..." and "...To motivate the study of implicit regularization, we then give the following definition of explicit regularization. We will also focus on methods that find global minima (but may not be those for $L$)..." Besides, to make the definition of explicit regularization clearer, we emphasized that $J_S$ is "related to $L$". Hope such changes can make this version more readable. Thank you for your suggestions!
Assigned Action Editor: ~Nadav_Cohen1
Submission Number: 1256