['6,16c6', '< Despite the nonconvex nature of neural networks, training them with local gradient methods finds nearly optimal parameters. Understanding the properties of the loss landscape is theoretically important, as it enables us to depict the learning dynamics of neural networks. For instance, many existing works prove that the loss landscape is "benign" in some sense -i.e. they don\'t have spurious local minima, bad valleys, or decreasing path to infinity Kawaguchi (2016), Venturi et al. (2019), Haeffele & Vidal (2017), Sun et al. (2020), Wang et al. (2021b), Liang et al. (2022). Such characterization enlightens our intuition on why these networks are trained so well.', '< As part of understanding the loss landscape, understanding the structure of global optimum has gained much interest. An example is mode connectivity Garipov et al. (2018), where a simple curve connects two global optima in the set of optimal parameters. Another example is analyzing the permutation symmetry that a global optimum has Simsek et al. (2021). Mathematically understanding the global optimum is important as it sheds light on the structure of the loss landscape. They can also motivate practical algorithms that search over neural networks with the same optimal cost Ainsworth et al. (2022), Mishkin & Pilanci (2023), having practical motivations to study.', '< We shape the loss landscape of regularized neural networks with ReLU activation, mainly analyzing mathematical properties of the global optimum, by considering its convex counterpart and leveraging the dual problem. Our work is inspired by the work of Mishkin & Pilanci (2023), where they characterize the optimal set and stationary points of a two-layer neural network with weight decay using the convex counterpart. They also introduce several important concepts such as the polytope characterization of the optimal solution set, minimal solutions, pruning a solution, and the optimal model fit. Expanding the idea of Mishkin & Pilanci (2023), we show a clear connection between the polytope characterization and the dual optimum. We further derive novel characters of the optimal set of neural networks, the loss landscape, and generalize the result to different architectures.', '< Finally, it is worth pointing out that regularization plays a central role in modern machine learning, including the training of large language models Andriushchenko et al. (2023). Therefore, including regularization better reflects the training procedure in practice. Figure 1: A schematic that illustrates the staircase of connectivity. This conceptual figure describes the topological change in solution sets as the number of neurons m changes in a high-level manner. Connected components that are not singletons are shown as blue sets, whereas singletons are depicted as red dots. When m = m * , there are only finitely many red dots. When m ≥ m * + 1, there exists a connected component that is not a singleton, i.e. a blue set. When m = M * , there exists a connected component which is a singleton, i.e. a red dot. When m ≥ M * + 1, there is no red dot. At last, when m ≥ min{m * + M * , n + 1}, there is a single blue set.', '< More importantly, adding regularization can change the qualitative behavior of the loss landscape and the global optimum Wang et al. (2021b): for example, there always exist infinitely many optimal solutions for the unregularized problem with ReLU activation due to positive homogeneity. However, regularizing the parameter weights breaks this tie and we may not have infinitely many optimal solutions. It is also possible to design the regularization for the loss landscape to satisfy certain properties such as no spurious local minima Liang et al. (2022), Ge et al. (2017) or unique global optimum Mishkin & Pilanci (2023), Boursier & Flammarion (2023). Understanding the loss landscape of regularized neural networks is not only a more realistic setup but can also give novel theoretical properties that the unregularized problem does not have.', '< The specific findings we have for regularized neural networks are:', "< • The optimal polytope: We revisit the fact that the regularized neural network's convex reformulation has a polytope as an optimal set Mishkin & Pilanci (2023). We give a connection between the dual optimum and the polytope.", '< • The staircase of connectivity: For two-layer neural networks with scalar output, we give critical widths and phase transitional behavior of the optimal set as the width of the network m changes. See Figure 1 for an abstract depiction of this phenomenon.', '< • Nonunique minimum-norm interpolators: We examine the problem in Boursier & Flammarion (2023) and show that free skip connections (i.e., an unregularized linear neuron), bias in the training problem, and unidimensional data are all necessary to guarantee the uniqueness of the minimumnorm interpolator. We construct explicit examples where the solution is not unique in each case, inspired by the dual problem. In contrast to the previous perspectives Boursier & Flammarion (2023), Joshi et al. (2023), our results imply that free skip connections may change the qualitative behavior of optimal solutions. Moreover, uniqueness does not hold in dimensions greater than one.', '< • Generalizations: We extend our results by providing a general description of solution sets of the cone-constrained group LASSO. The extensions include the existence of fixed first-layer weight directions for parallel deep neural networks, and connectivity of optimal sets for vector-valued neural networks with regularization.', '< The paper is organized as follows: after discussing related work (Section 1.1) and notations (Section 1.2), we discuss the convex reformulation of neural networks as a preliminary in Section 2. Then we discuss the case of two-layer neural networks with scalar output in Section 3, starting from the optimal polytope characterization (Section 3.1), the staircase of connectivity (Section 3.2), and construction of non-unique minimum-norm interpolators (Section 3.3). The possible generalizations are introduced in Section 4. Finally, we conclude the paper in Section 5. Detailed explanations of the experiments and proofs are deferred to the appendix.', '---', '> The remarkable success of neural networks, despite their inherently non-convex optimization landscapes, has spurred significant research into understanding the properties of these landscapes. A thorough understanding of the loss landscape is paramount for deciphering the learning dynamics and generalization capabilities of neural networks. Prior works have extensively demonstrated that, under various conditions, the loss landscape can be "benign," exhibiting properties such as the absence of spurious local minima, bad valleys, or paths leading to infinite loss (Kawaguchi (2016), Venturi et al. (2019), Haeffele & Vidal (2017), Sun et al. (2020), Wang et al. (2021b), Liang et al. (2022)). Such characterizations provide crucial insights into the empirical success of deep learning.', '17a8,21', '> A particularly active area of research focuses on the structure of global optima. Concepts like mode connectivity (Garipov et al. (2018)), where simple curves connect distinct global optima, and the analysis of permutation symmetries (Simsek et al. (2021)), have illuminated the complex geometry of optimal parameter spaces. A deeper mathematical understanding of global optima is not only theoretically enriching but also holds practical implications, inspiring algorithms that explore networks with equivalent optimal costs (Ainsworth et al. (2022), Mishkin & Pilanci (2023)).', '> ', '> This paper systematically investigates the loss landscape of regularized neural networks with ReLU activation, primarily focusing on the mathematical properties of global optima. Our methodology hinges on transforming the non-convex problem into an equivalent convex counterpart and leveraging insights from its dual problem. Building upon the foundational work of Mishkin & Pilanci (2023), which characterized optimal sets and stationary points of two-layer networks with weight decay, we further elucidate the connection between polytope characterizations and the dual optimum. We extend these analyses to derive novel insights into the optimal set structures, the overall loss landscape, and generalize our findings to a broader range of architectures.', '> ', '> Regularization, a cornerstone of modern machine learning, including the training of large language models (Andriushchenko et al. (2023)), plays a critical role in shaping the loss landscape. Its inclusion in our analysis provides a more realistic model of practical training procedures. Moreover, regularization can qualitatively alter the landscape, for instance, by breaking the infinite degeneracy of solutions present in unregularized ReLU networks due to positive homogeneity (Wang et al. (2021b)). Carefully designed regularization can even enforce desirable properties such as the absence of spurious local minima (Liang et al. (2022), Ge et al. (2017)) or a unique global optimum (Mishkin & Pilanci (2023), Boursier & Flammarion (2023)). Thus, studying regularized neural networks offers both a more practical setting and reveals novel theoretical properties absent in their unregularized counterparts.', '> ', '> Our key contributions in this work are:', '> • The Optimal Polytope and Dual Connection: We provide a refined understanding of the optimal set of regularized neural networks, demonstrating its polyhedral structure. Crucially, we establish a direct link between this polytope characterization and the dual optimum, offering a novel perspective on how optimal parameter directions are determined.', '> • The Staircase of Connectivity: For two-layer neural networks with scalar output, we precisely identify critical widths at which the topology of the optimal set undergoes phase transitions as the number of neurons m changes. This phenomenon, abstractly depicted in Figure 1, reveals a complex interplay between network capacity and solution connectivity.', '> • Non-Unique Minimum-Norm Interpolators: We critically re-examine the uniqueness of minimum-norm interpolators, particularly in the context of free skip connections, bias, and data dimensionality. Contrary to some prior beliefs (Boursier & Flammarion (2023), Joshi et al. (2023)), we construct explicit examples demonstrating non-uniqueness when free skip connections, bias, or unidimensional data assumptions are relaxed. Our dual-problem-inspired constructions highlight the significant role these architectural choices and data properties play.', '> • Generalizations to Diverse Architectures: We extend our theoretical framework to provide a general description of solution sets for cone-constrained group LASSO problems. This includes novel results on fixed first-layer weight directions for parallel deep neural networks and the connectivity of optimal sets for vector-valued neural networks under regularization, showcasing the broad applicability of our approach.', '> ', '> The remainder of the paper is structured as follows: Section 1.1 reviews related work, and Section 1.2 introduces notations. Section 2 provides a preliminary discussion on convex reformulations of neural networks. Section 3 delves into two-layer scalar output neural networks, covering the optimal polytope characterization (Section 3.1), the staircase of connectivity (Section 3.2), and the construction of non-unique minimum-norm interpolators (Section 3.3). Section 4 presents generalizations to other architectures. Finally, Section 5 concludes the paper, with detailed experiments and proofs relegated to the appendix.', '> ', '1491d1494', '< ']
