Title: EXPLORING THE LOSS LANDSCAPE OF REGULARIZED NEURAL NETWORKS VIA CONVEX DUALITY

Abstract: We discuss several aspects of the loss landscape of regularized neural networks: the structure of stationary points, connectivity of optimal solutions, path with nonincreasing loss to arbitrary global optimum, and the nonuniqueness of optimal solutions, by casting the problem into an equivalent convex problem and considering its dual. Starting from two-layer neural networks with scalar output, we first characterize the solution set of the convex problem using its dual and further characterize all stationary points. With the characterization, we show that the topology of the global optima goes through a phase transition as the width of the network changes, and construct examples where the problem may have a continuum of optimal solutions. Finally, we show that the solution set characterization and connectivity results can be extended to different architectures, including two-layer vector-valued neural networks and parallel three-layer neural networks.

Section: INTRODUCTION
Despite the nonconvex nature of neural networks, training them with local gradient methods finds nearly optimal parameters. Understanding the properties of the loss landscape is theoretically important, as it enables us to depict the learning dynamics of neural networks. For instance, many existing works prove that the loss landscape is "benign" in some sense -i.e. they don't have spurious local minima, bad valleys, or decreasing path to infinity Kawaguchi (2016), Venturi et al. (2019), Haeffele & Vidal (2017), Sun et al. (2020), Wang et al. (2021b), Liang et al. (2022). Such characterization enlightens our intuition on why these networks are trained so well.
As part of understanding the loss landscape, understanding the structure of global optimum has gained much interest. An example is mode connectivity Garipov et al. (2018), where a simple curve connects two global optima in the set of optimal parameters. Another example is analyzing the permutation symmetry that a global optimum has Simsek et al. (2021). Mathematically understanding the global optimum is important as it sheds light on the structure of the loss landscape. They can also motivate practical algorithms that search over neural networks with the same optimal cost Ainsworth et al. (2022), Mishkin & Pilanci (2023), having practical motivations to study.
We shape the loss landscape of regularized neural networks with ReLU activation, mainly analyzing mathematical properties of the global optimum, by considering its convex counterpart and leveraging the dual problem. Our work is inspired by the work of Mishkin & Pilanci (2023), where they characterize the optimal set and stationary points of a two-layer neural network with weight decay using the convex counterpart. They also introduce several important concepts such as the polytope characterization of the optimal solution set, minimal solutions, pruning a solution, and the optimal model fit. Expanding the idea of Mishkin & Pilanci (2023), we show a clear connection between the polytope characterization and the dual optimum. We further derive novel characters of the optimal set of neural networks, the loss landscape, and generalize the result to different architectures.
Finally, it is worth pointing out that regularization plays a central role in modern machine learning, including the training of large language models Andriushchenko et al. (2023). Therefore, including regularization better reflects the training procedure in practice. Figure 1: A schematic that illustrates the staircase of connectivity. This conceptual figure describes the topological change in solution sets as the number of neurons m changes in a high-level manner. Connected components that are not singletons are shown as blue sets, whereas singletons are depicted as red dots. When m = m * , there are only finitely many red dots. When m ≥ m * + 1, there exists a connected component that is not a singleton, i.e. a blue set. When m = M * , there exists a connected component which is a singleton, i.e. a red dot. When m ≥ M * + 1, there is no red dot. At last, when m ≥ min{m * + M * , n + 1}, there is a single blue set.
More importantly, adding regularization can change the qualitative behavior of the loss landscape and the global optimum Wang et al. (2021b): for example, there always exist infinitely many optimal solutions for the unregularized problem with ReLU activation due to positive homogeneity. However, regularizing the parameter weights breaks this tie and we may not have infinitely many optimal solutions. It is also possible to design the regularization for the loss landscape to satisfy certain properties such as no spurious local minima Liang et al. (2022), Ge et al. (2017) or unique global optimum Mishkin & Pilanci (2023), Boursier & Flammarion (2023). Understanding the loss landscape of regularized neural networks is not only a more realistic setup but can also give novel theoretical properties that the unregularized problem does not have.
The specific findings we have for regularized neural networks are:
• The optimal polytope: We revisit the fact that the regularized neural network's convex reformulation has a polytope as an optimal set Mishkin & Pilanci (2023). We give a connection between the dual optimum and the polytope.
• The staircase of connectivity: For two-layer neural networks with scalar output, we give critical widths and phase transitional behavior of the optimal set as the width of the network m changes. See Figure 1 for an abstract depiction of this phenomenon.
• Nonunique minimum-norm interpolators: We examine the problem in Boursier & Flammarion (2023) and show that free skip connections (i.e., an unregularized linear neuron), bias in the training problem, and unidimensional data are all necessary to guarantee the uniqueness of the minimumnorm interpolator. We construct explicit examples where the solution is not unique in each case, inspired by the dual problem. In contrast to the previous perspectives Boursier & Flammarion (2023), Joshi et al. (2023), our results imply that free skip connections may change the qualitative behavior of optimal solutions. Moreover, uniqueness does not hold in dimensions greater than one.
• Generalizations: We extend our results by providing a general description of solution sets of the cone-constrained group LASSO. The extensions include the existence of fixed first-layer weight directions for parallel deep neural networks, and connectivity of optimal sets for vector-valued neural networks with regularization.
The paper is organized as follows: after discussing related work (Section 1.1) and notations (Section 1.2), we discuss the convex reformulation of neural networks as a preliminary in Section 2. Then we discuss the case of two-layer neural networks with scalar output in Section 3, starting from the optimal polytope characterization (Section 3.1), the staircase of connectivity (Section 3.2), and construction of non-unique minimum-norm interpolators (Section 3.3). The possible generalizations are introduced in Section 4. Finally, we conclude the paper in Section 5. Detailed explanations of the experiments and proofs are deferred to the appendix.

Section: RELATED WORK
Convex Reformulations of Neural Networks Starting from Pilanci & Ergen (2020), a series of works have concentrated in reformulating a neural network optimization problem to an equivalent convex problem and training neural networks to global optimality. It has been shown that many different existing neural network architectures with weight decay have such convex formulations, including vector-valued neural networks Sahiner et al. (2020), CNNs Ergen & Pilanci (2020), and parallel three-layer networks Ergen & Pilanci (2021). Furthermore, properties of the original nonconvex problem such as the characterization of all Clarke stationary points Wang et al. (2021b), and the polyhedral characterization of optimal set Mishkin & Pilanci (2023) have also been discussed.
Connectivity of optimal sets of neural networks Mode connectivity is an empirical phenomenon where the optimal parameters of neural networks are connected by simple curves of almost similar training/test accuracy Garipov et al. (2018). An intriguing phenomenon itself, it has given rise to theoretical analysis of the connectivity of optimal solutions: to name a few, Kuditipudi et al. (2019) introduces the concept of dropout stability to explain such phenomena, Zhao et al. (2023) uses group theory to understand the connected components of deep linear neural networks, and Akhtiamov & Thomson (2023) introduces theory from differential topology to understand mode connectivity. Permutation symmetry in the parameter space also plays an important role in understanding connectivity. Simsek et al. (2021) shows that assuming a unique global minimizer modulo permutations of a certain size, increasing the size of each layer by one connects all global optima. Unfortunately, their assumption does not hold in our case (Appendix G). A similar characterization is also done in Brea et al. (2019), where saddle points with permutation symmetry are connected. Sharma et al. (2024) further discusses different notions of linear connectivity modulo permutations. A different line of work concentrates on the connection between overparametrization and connectivity of solutions: the main insight here is that when the model is as large as the number of data, the solution set becomes connected Nguyen (2021), Nguyen et al. (2021), Nguyen (2019). Cooper (2021) has a similar connection for overparametrized networks, where they characterize the dimension of the manifold of the optimal parameter space.
Phase transitional behavior of the loss landscape Here we introduce existing work in the literature that gives a characterization saying "adding one more neuron can change the qualitative behavior of the loss landscape", hence having the notion of critical model sizes. We reiterate Nguyen (2021) and Simsek et al. (2021), where adding one neuron changes the connectivity behavior of the optimal set. Liang et al. (2018) adds an exponential neuron, which is a specifically designed neuron, along with a specific regularization to eliminate all spurious local minimum. Venturi et al. (2019) has the idea of defining upper / lower intrinsic dimensions of the training problem in the unregularized case, and shows that the quantity is related to whether the training problem has no spurious valleys. Li et al. (2022) discusses a critical width m * where m ≥ m * , all suboptimal basins are eliminated for certain activation functions. They also discuss how m * is related with n, the number of data.
Loss landscapes and optimal sets of regularized networks Freeman & Bruna (2016) discusses the loss landscape of the population loss along with a certain regularization, and proves the asymptotic connectivity of all sublevel sets as m increases. Bietti et al. (2022) also introduces an asymptotic landscape result for regularized networks. Haeffele & Vidal (2017) deduces the loss landscape of parallel neural networks with the lens of convex equivalent problem, and shows that when the width m is larger than a certain threshold, there are no spurious local minima. Kunin et al. (2019) analyzes regularized linear autoencoders and points out the discrete structure of critical points under some symmetries. Bucarelli et al. (2024) bounds the Betti number of the sublevel set of the loss landscape for Pfaffian activations, discussing topological complexity of sublevel sets for both the unregularized and the regularized case. On the empirical side, Yang et al. (2021) considers certain metrics to consider the mode connectivity and sharpness of the landscape of regularized neural networks, and indeed show that larger models tend to have more connected solutions. A few work design specific regularization to make the loss landscape benign, removing spurious local minima and decreasing paths to infinity Ge et al. (2017), Liang et al. (2022).

Section: Properties of unidimensional minimum-norm interpolators
Training minimum-norm interpolators for unidimensional data can lead to sparse interpolators Parhi & Nowak (2023). When we do not penalize the bias, Savarese et al. (2019) has an exact characterization of the interpolation problem in function space, and Hanin (2021) completely characterizes the set of optimal interpolators. From the construction of optimal interpolators, it is natural that there exist problems with a continuum of infinitely many optimal interpolators. A recent work by Nakhleh et al. (2024) extends this setup to vector-valued networks and shows almost-sure uniqueness of a minimum norm interpolator. On the other hand, Boursier & Flammarion (2023) recently showed that when we penalize the bias with free skip connections, we have a unique optimal interpolator. Furthermore, under certain assumptions on the training data, the optimal interpolator is the sparsest. Empirically, it has been believed that having a free skip connection does not affect the behavior of the solution Boursier & Flammarion (2023), Joshi et al. (2023).

Section: PROBLEM SETTING AND NOTATIONS
We are interested in training a neural network with regularization and ReLU activation, namely the optimization problem
min θ∈R p L(f θ (X), y) + βR(θ).(1)
Here, X ∈ R n×d is the data matrix, y ∈ R n is the label vector, θ ∈ R p the concatenation of all parameters of the neural network, f θ the parametrization, β > 0 strength of the regularization, L : R n × R n → R the convex loss function, and R : R p → R the regularization.
We have two different objects of interest in the notion of optimal sets: the optimal solution set in parameter space and the set of optimal functions Θ * := arg min
θ∈R p L(f θ (X), y) + βR(θ) ⊆ R p , F * := {f θ | θ ∈ Θ * } ⊆ F, (2
)
where F is the set of functions f : R d → R. The notion of optimal functions will mostly be discussed in Section 3.3, where we discuss minimum-norm interpolators. Note that Θ * regards parameters with permutation symmetry as different parameters.
Next, we clarify the notion of connectivity in this paper. We say two points x, y ∈ S is connected in S if for two points x, y ∈ S, there exists a continuous function f : [0, 1] → S that satisfies f (0) = x, f (1) = y. We say S is connected if for any two points x, y ∈ S, x and y are connected in S. Also, an isolated point x in S means a point that has no continuous path from x to S -{x}.
At last, we clarify the notations. the notation 1(condition(A)) is defined for a scalar, vector, or matrix that notes if the entrywise condition is met, the value is 1, and else 0. Note [m] = {1, 2, • • • , m}, ∥•∥ 2 as the l 2 norm, ∥•∥ F as the Frobenious norm, (•) + as the ReLU function, and diag the diagonal matrix given a vector. By a hyperplane arrangement, we mean a diagonal matrix diag(1(Xh ≥ 0)) for a vector h ∈ R d . When we write D i for i ∈ [P ], we mean all possible hyperplane arrangements generated from the data matrix X ∈ R n×d , hence P means the number of all possible arrangement patterns. We also use the notation
K i = {u | (2D i -I)Xu ≥ 0} for i ∈ [P ]
unless specified differently (in vector-valued networks we will). By a ⊕ b, we mean the concatenation of two vectors(or matrices) a and b:
if a ∈ R m and b ∈ R n , a ⊕ b ∈ R m+n , (a i ) p i=1 denotes a 1 ⊕ a 2 ⊕ • • • a p .
For matrices, the notation A i• means the i-th row of A, A •i means the i-th column of A, and for vector v, v ,k denotes the k-th entry of v. We note the matrix inner product ⟨A, B⟩ M = tr(A T B).

Section: CONVEX REFORMULATIONS
Our main proof strategy will be introducing an equivalent convex reformulation of the training problem first introduced in Pilanci & Ergen (2020). In this section, we demonstrate the concept by giving an example for two-layer scalar output networks with weight decay.
Consider the optimization problem in equation 3,
p * := min {wj ,αj } m j=1 L   m j=1 (Xw j ) + α j , y   + β 2 m j=1 ∥w j ∥ 2 2 + α 2 j .(3)
The variables
w j ∈ R d , α j ∈ R for j ∈ [m].
When the width m of problem in equation 3 satisfies m ≥ m * for a critical threshold m * ≤ n, we have an equivalent convex problem given as a coneconstrained group LASSO,
p * cvx := min {ui,vi} P i=1 , ui,vi∈Ki L P i=1 D i X(u i -v i ), y + β P i=1 (∥u i ∥ 2 + ∥v i ∥ 2 ) .(4)
The intuition of convexification is constraining each variable at a certain convex cone so that the model looks linear in that region, and applying an appropriate scaling to deal with regularization.
As an equivalent convex problem, the optimal values p * and p * cvx are equal. Moreover, from a solution (u i , v i ) P i=1 of equation 4 satisfying m = P i=1 1(u i ̸ = 0) + 1(v i ̸ = 0), we can recover the solution of equation 3 with m neurons by a solution mapping
(w i , α i ) = (u i / ∥u i ∥ 2 , ∥u i ∥ 2 ) for i ∈ [a], (w i+a , α i+a ) = (v i / ∥v i ∥ 2 , -∥v i ∥ 2 ) for i ∈ [m -a], without loss of generality assuming u i ̸ = 0 for i ∈ [a] and v i ̸ = 0 for i ∈ [m -a].
The problem in equation 3 has a convex dual given as
d * := max |ν T (Xu)+|≤β, ∀∥u∥2≤1 -L * (ν),(5)
where L * is the convex conjugate of L(•, y) and ν denotes the dual variable. Note that strong duality holds and p * = p * cvx = d * is satisfied when m ≥ m * . Furthermore, we will see that the dual optimum ν * determines the optimal set of both the convex problem in equation 4 and the original problem in equation 3.

Section: TWO-LAYER SCALAR OUTPUT NEURAL NETWORKS


Section: THE OPTIMAL POLYTOPE
We first describe the optimal set of the problem in equation 4 where L is strictly convex. Note that the polytope characterization was first done in Mishkin & Pilanci (2023). Here, we emphasize the role of dual optimum in choosing the unique directions. To illustrate the solution set of equation 4, we introduce the notion of an optimal model fit and further characterize it as a singleton. Proposition 1. Mishkin & Pilanci (2023) Let the optimal solution set of equation 4 as Θ * . If the loss function L is strictly convex, the optimal model fit is unique, i.e. the set of optimal model fit
C y = P i=1 D i X(u * i -v * i ) | (u * i , v * i ) P i=1 ∈ Θ * = {y * } for some y * ∈ R n .
The solution set of equation 4 is given as Theorem 1. For a formal statement see Theorem C.1. Theorem 1. (The Optimal Polytope, informal) Suppose L is a strictly convex loss function. The directions of optimal parameters of the problem in equation 4, noted as ūi , vi , are uniquely determined from the dual optimum ν * . Moreover, the solution set of equation 4 is the polytope,
P * ν * := (c i ūi , d i vi ) P i=1 | c i , d i ≥ 0 ∀i ∈ [P ], P i=1 D i X ūi c i -D i X vi d i = y * ⊆ R 2dP ,(6)
for the unique optimal model fit y * defined in Proposition 1.

Section: Note that P *
ν * is invariant under different choices of ν * , because they all correspond to the solution set of equation 4. Hence, we use P * for simplicity. For a geometric intuition of ν * , see Appendix G.
Theorem 1 implies that equation 4 has a unique direction for each u i , v i where i ∈ [P ], which is determined by solving the dual problem. The intuition for this fact is quite clear: when we assume there exist two different solutions (u i , v i ) P i=1 and (u ′ i , v ′ i ) P i=1 where u i and u ′ i are not colinear for some i ∈
[P ], ((u i + u ′ i )/2, (v i + v ′ i )/2) P i=1
has a strictly smaller objective because L is strictly convex and ∥a∥ 2 + ∥b∥ 2 ≥ ∥a + b∥ 2 with equality only when a and b are colinear. However, Theorem 1 implies further, that for any conic combination of such vectors D i X ūi and -D i X vi that sum up to y * , it becomes an optimal solution of equation 4.
Another implication of Theorem 1 is that for all stationary points of equation 3, there exists a finite set of possible first-layer weight directions. For a formal statement see Corollary C.1. Corollary 1. Denote the set of Clarke stationary points of equation 3 as Θ C . The set of directions of the stationary point
m j=1 w j /∥w j ∥ 2 | (w i , α i ) m i=1 ∈ Θ C , w j ̸ = 0
is finite, and is determined by the dual optimum of subsampled convex problems.
The result follows from using the fact proven in Ergen & Pilanci (2023), where all stationary points of equation 3 are characterized by the global minimizer of the subsampled convex program that has the same structure with equation 4. The implication shows that not only the global minimum but the stationary points of equation 3 also have a structure that is related to the convex problem.

Section: THE STAIRCASE OF CONNECTIVITY
One significance of this characterization is that when m ≥ m * , we can relate the optimal solution set of the nonconvex problem in equation 3 with the subsets of equation 6 with certain cardinality constraints. Specifically, the cardinality-constrained set
P * (m) := (u i , v i ) P i=1 | (u i , v i ) P i=1 ∈ P * , P i=1 1(u i ̸ = 0) + 1(v i ̸ = 0) ≤ m ⊆ R 2dP ,(7)
will determine the solution set of equation 3 , namely Θ * (m), when m ≥ m * . We write Θ * (m) to emphasize the dependency of m, since we illustrate a phase-transitional behavior as m changes. For a formal definition of Θ * (m) see Appendix D. The cardinality constraint is the main reason behind the staircase of connectivity: if m were to be unbounded, the optimal set would be a single connected polytope. However, as m becomes smaller, certain regions in the polytope are not reachable, possibly becoming disconnected.
Our proof strategy is first observing phase transitional behaviors in the cardinality-constrained set P * (m), and linking the connectivity behavior of P * (m) and Θ * (m) with appropriate solution mappings (Definition D.7, Definition D.8). Aside from the proof of Theorem 2, the machinery we develop can potentially be applied to extend other topological properties of P * (m) to Θ * (m).
Theorem 2 states the staircase of connectivity informally. For a formal statement and a precise definition of critical widths see Theorem D.2. Note that we get rid of the trivial case where
P * = {(0, 0) P i=1 } by assuming (w i , α i ) m i=1 ̸ = (0, 0) m i=1 ∈ Θ * (m)
exists for some m (Proposition D.1). Theorem 2. (The staircase of connectivity, informal) Denote the optimal solution set of equation 3 in parameter space as Θ * (m) ⊆ R (d+1)m . Suppose L is a strictly convex loss function and there exists (w i , α i ) m i=1 ̸ = (0, 0) m i=1 ∈ Θ * (m) for some m. We have critical widths m * , M * that determine the phase transitional behavior of the solution set. Specifically, as m changes, we have that when (i) m = m * , Θ * (m) is a finite set. Hence, all solutions are disconnected to each other.
(ii) m ≥ m * + 1, there exists A ̸ = A ′ ∈ Θ * (m) and a path in Θ * (m) connecting them.
(iii) m = M * , Θ * (m) is not a connected set. Moreover, there exists an isolated point in Θ * (m).
(iv) m ≥ M * + 1, permutations of the solution are connected with no isolated points in Θ * (m).
(v) m ≥ min{m * + M * , n + 1}, the set Θ * (m) is connected.
Figure 1 demonstrates Theorem 2 at a conceptual level. When m = m * , the solution set has a discrete structure. One way to see the fact is that when m = m * , the solutions are vertices of the polytope P * , hence they have a discrete and isolated structure. When m ≥ m * + 1, we have a trivial "splitting" operation that connects two solutions with m * nonzero first-layer weights, which leads to the existence of a "blue set"(a connected component with infinitely many solutions) in Figure 1. When m = M * , the solution having linearly independent first-layer weights with maximum cardinality corresponds to the isolated point in Θ * (M * ). When m ≥ M * + 1, on the other hand, any solution is connected with permutations of the same solution. The proof follows from first creating a zero slot in the first layer weights and using the zero slot to permute. The idea of the proof is identical to that of Simsek et al. (2021), though the details differ. At last, when m ≥ min{m * + M * , n + 1}, the whole set is connected: m * + M * is obtained from first transforming the solution to have linearly independent first-layer weights and interpolating the solution with minimum cardinality. n+1 follows from the fact that P * (n+1) is connected, which needs a more sophisticated argument. For details see the proof in Appendix D. Note that there exists algorithms that can exactly compute these critical widths Remark D.1.
From Haeffele & Vidal (2017), we know that when m ≥ n + 1 we have that all local minima are global (Theorem 2, Haeffele & Vidal (2017)) and moreover we have a path with non-increasing objective to a global optimum starting from any point Vidal et al. (2022) 
m m = 1 m = 2 m = 3
(a i x j + b i ) + θ i -y j 2 + β 2 m i=1 (θ 2 i + a 2 i + b 2 i ).
In Figure 2, we plot the loss landscape and the corresponding optimal functions when β = 0.1 for m = 1, 2, 3.
The upper half of Figure 2 illustrates how the loss landscape looks near the global minima, and visualizes the optimal solution set for m = 1, 2, 3. The lower half of Figure 2 shows the optimal learned function for m = 1, 2, 3. The black dots are the datapoints, the red dots correspond to the optimal model fit y * , and the red/blue functions correspond to the functions parametrized by the red/blue sets in the loss landscape, respectively.
When m = 2, two different functions are shown. This is because the connected component with infinitely many solutions emerges from the split of a single neuron corresponding to the same optimal function in F * . When m = 3, we have a sequence of functions that continuously deform from one to another with the same cost. For details on the solution set of the training problem, parameterization of optimal functions, and how the loss landscape is visualized, see Appendix A.
Published as a conference paper at ICLR 2025

Section: NON-UNIQUE OPTIMAL INTERPOLATORS
In this section, we will see how the dual problem can be used to construct specific problem instances that have non-unique interpolators. There are three different setups of interest. First, the minimumnorm interpolation problem with free skip connection and regularized bias (where we denote as SB (Skip connection; Bias)) refers to the problem in equation 8, namely
min m,{ai,biθi} m i=0 m i=1 ∥a i ∥ 2 2 + b 2 i + θ 2 i , subject to Xa 0 + b 0 1 + m i=1 (Xa i + b i 1) + θ i = y. (8)
The parameters satisfy a i ∈ R d and b i , θ i ∈ R for i ∈ [m] ∪ {0}, and 1 ∈ R n is a vector of ones.
The term "free skip connection" arises, as we have a skip connection, i.e. the linear neuron a 0 , that is not regularized. Next, we discuss the minimum-norm interpolation problem without free skip connections and regularized bias (NSB: No-Skip; Bias), which is the training problem in equation 8 with an additional constraint a 0 = b 0 = 0. At last we study the minimum-norm interpolation problem with free skip connections but without bias (SNB: Skip; No-Bias), which is the problem in equation 8 with b i = 0 for all i ∈ [m] ∪ {0}. Also, note that the width m is also optimized.
In Boursier & Flammarion (2023), it was proven that for unidimensional data, i.e. when d = 1, the set of all optimal functions F * of equation 8 is a singleton. When d > 1, it is not the case. This fact implies that to extend Boursier & Flammarion (2023) to higher dimensions, we may need additional structures besides free skip connections. Proposition 2. When X ∈ R n×2 , y ∈ R n , we have a dataset (X, y) that has non-unique minimumnorm interpolator both for the SB and SNB problem in equation 8.
When we have no free skip connections, i.e. the case of NSB, for d = 1 we have a class of data that has infinitely many optimal interpolators. The construction follows from making the dual problem max ∥u∥≤1 |ν T (Xu) + | have linearly dependent solutions by forcing n + 1 optimal solutions. For a rigorous construction of (X, y), see Proposition E.4. Proposition 3. (A class of training problems with infinitely many optimal interpolators, informal) Consider the NSB problem in equation 8 with d = 1. For all n ≥ 2, we can construct infinitely many different datasets (X, y) having infinitely many minimum-norm interpolators.
In the following example, we give a geometric description of finding the dataset (X, y) and the continuum of optimal interpolators for n = 5. Figure 3b show the continuum of optimal interpolators. We can see that there are infinitely many interpolators with the same cost. A magnification of the range x ∈ [-8, 0] is given to emphasize that the interpolators are indeed different. We can also see from Figure 3c that gradient descent learns the continuum of optimal interpolators. Here we set m = 10. For details on the formula of optimal interpolators, see Appendix B. Figure 3a shows the geometric construction behind finding v s proposed in Proposition 3. Figure 3b shows the continuum of optimal interpolators, and Figure 3c shows the learned interpolators trained by gradient descent.
Proposition 3 demonstrates that understanding the optimal solution set with dual optimum enables us to enforce non-uniqueness to the solution set. Moreover, these examples are not constructed caseby-case, but from a geometric structure that is motivated by the object
Q X = {(Xu) + | ∥u∥ 2 ≤ 1}, the convex set Conv(Q X ∪ -Q X )
, and its supporting hyperplane.
Experimentally, the existence of free skip connections does not seem important in the behavior of the solution Boursier & Flammarion (2023), Joshi et al. (2023). However, note that when there is no skip connection, there exists training problems where the minimum-norm interpolator has infinitely many solutions. Furthermore the interpolators in Example 1 and Example 2 may have n breakpoints even with Assumption 1 in Boursier & Flammarion (2023) -which can never be the sparsest interpolator. Hence, at least theoretically, free skip connection plays a significant role in guaranteeing the uniqueness and sparsity of the interpolator, along with penalizing the bias. Note that these different interpolators may have drastically different behavior for points not in the training set. For example, as x → ±∞, the difference between any two different interpolators diverge.

Section: GENERALIZATIONS
In this section, we will extend our results from Section 3 to a more general training setup. We use the fact that for networks of sufficiently large width, training a neural network can be cast as a coneconstrained group LASSO problem Mishkin & Pilanci (2023). Analogous to Theorem 1, we first derive the optimal set of a general cone-constrained group LASSO in equation 9:
min θi∈Ci∩Vi,si∈Di L( P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R i (θ i ).(9)
Here,
A i , B i ∈ R n×d , θ i , s i ∈ R d , y ∈ R n , C i , D i are proper cones, R i
is the regularization (which we assume to be a norm defined in a subspace V i ⊆ R d and satisfy V i ∩ C i ̸ = ∅), β > 0 is the regularization strength, and L : R n × R n → R is a convex but not necessarily strictly convex loss function.
The assumption that R i is a norm is natural because it will help the training problem find a simpler solution. Equation ( 9) enables the analysis of many different training setups, including twolayer networks with free skip connections, interpolation, vector-valued outputs, and parallel deep networks of depth 3, extending the results in Section 3.

Section: DESCRIPTION OF THE MINIMUM-NORM SOLUTION SET
The idea to derive the optimal set of equation 9 is essentially the same as deriving the optimal set of equation 4: we consider the dual problem and use strong duality to obtain the wanted result. The exact description of the optimal set is given as
P * gen = (c i θi ) P i=1 ⊕(s i ) Q i=1 | c i ≥ 0, P i=1 c i A i θi + Q i=1 B i s i ∈ C y , θi ∈ Θi , ⟨B T i ν * , s i ⟩ = 0, s i ∈ D i .
(10) First, C y is the set of optimal model fits which was defined at Proposition 1. We have that θi is contained in a certain set, which is an analogy of the optimal polytope in the direction of each variable is fixed. Theorem 1 is a special case where the set of optimal directions is a singleton. Finally, we have a constraint given to variables without regularization, which is also derived from the dual formulation. For a detailed derivation see Theorem F.1 Given the expression in equation 10, we can extend our results in Theorem 1 and Theorem 2 directly to the interpolation problem (Proposition F.1, Proposition F.2) . That is because for the interpolation problem, C y is a singleton and the set Θi is also a singleton. We can also find the optimal set characterization of the interpolation problem with free skip connections (Proposition F.3). Here, the dual variable has to satisfy ⟨X T ν * , s⟩ = 0 for all s ∈ R d , meaning X T ν * = 0 is given as an additional constraint for the dual problem. The additional constraint is the main reason why we have qualitatively different behavior in uniqueness when we have free skip connections.

Section: VECTOR-VALUED NETWORKS
Now we turn to two-layer neural networks with vector-valued outputs, namely the problem min
(wi,zi) m i=1 1 2 ∥ m i=1 (Xw i ) + z T i -Y ∥ 2 F + β 2 m i=1 ∥w i ∥ 2 2 + ∥z i ∥ 2 2 ,(11)
where w i ∈ R d×1 , z i ∈ R c×1 and Y ∈ R n×c . The vector-valued problem is known to have a convex reformulation Sahiner et al. (2020), which is given as
min Vi 1 2 ∥ P i=1 D i XV i -Y ∥ 2 2 + β P i=1 ∥V i ∥ Ki, * .
The norm ∥V ∥ Ki, * is defined as
∥V ∥ Ki, * := min t ≥ 0 s.t. V ∈ tK i , for K i = conv{ug T |(2D i - I)Xu ≥ 0, ∥ug T ∥ * ≤ 1}, and V i = span{ug T |(2D i -I)Xu ≥ 0, g ∈ R c }.
The problem falls in our category of equation 9, and we can describe the optimal set of the problem completely. With appropriate solution maps, we can describe the optimal set of the nonconvex vector-valued problem in equation 11 (Proposition F.5), and the same idea can be applied to describe a subset of the optimal solution set for deep networks (Theorem F.2). A direct implication extends the loss landscape result in Corollary 2 to vector-valued networks. Corollary 4. Consider the problem in equation 11 with m ≥ nc + 1. For any θ := (w i , z i ) m i=1 ∈ R (d+c)m , there exists a continuous path from θ to any global optimum θ * with nonincreasing loss.

Section: PARALLEL DEEP NEURAL NETWORKS
Finally, we extend the characterization to deeper networks. Convex reformulations of parallel threelayer neural networks have been discussed Ergen & Pilanci (2021), Wang et al. (2021a). The specific training problem we are interested in is
min m,{W1i,w2i,αi} m i=1 1 3 m i=1 ∥W 1i ∥ 3 F + ∥w 2i ∥ 3 2 + |α i | 3 s.t. m i=1 ((XW 1i ) + w 2i ) + α i = y. (12
)
The size of each weights are W 1i ∈ R d×m1 , w 2i ∈ R m1 , and α i ∈ R. The dual problem of the convex reformulation can be understood as optimizing a linear function with cone and norm constraints, and we have analogous results of the optimal polytope. Specifically, we have the direction of the columns of first-layer weights as a set of finite vectors (Theorem 3). The result suggests that our results are fairly generic, and could be generalized to other deep parallel architecture with appropriate parametrization. For detailed proof see Appendix F. Theorem 3. Consider the training problem in equation 12. Then, there are only finitely many possible values of the direction of the columns of W * 1i . Moreover, the directions are determined by solving the dual problem max ∥W1∥ F ≤1,∥w2∥2≤1 |(ν * ) T ((XW 1 ) + w 2 ) + | when y ̸ = 0.

Section: CONCLUSION
In this paper, we present an in-depth exploration of the loss landscape and the solution set of regularized neural networks. We start with a two-layer scalar neural network as the simplest case and demonstrate the properties of the set, including the existence of optimal directions, phase transition in connectivity, and non-uniqueness of minimum-norm interpolators. Then, we give a more general description on the optimal set of cone-constrained group LASSO and extend the previous results to a more general setup.
Our paper may be extended in multiple ways. One interesting problem that is left is, what is the right architecture to ensure the uniqueness of the minimum-norm interpolator for high dimensions, as free skip connection itself does not help. Another interesting problem is showing 'almost sure uniqueness' of problem equation 3, up to permutations: intuitively, from the examples we show, it can be speculated that the solution of the dual problem max |ν T (Xu) + | subject to ∥u∥ ≤ 1 is unlikely to have "too many" optimal solutions. Hence it is likely that the dataset that makes the minimum-norm interpolator non-unique is very small. We conjecture that the set will have measure 0 in R 2n , and leave it for future work. At last, extending the optimal polytope/connectivity results to tree neural networks Zeger et al. (2024) with arbitrary depth could be a meaningful contribution.

Section: REPRODUCIBILITY STATEMENT
The only randomness that occurs from our experiments are Figure 5a, Figure 5b, Figure 6a, and Figure 6b, where different initialization may lead to different learned interpolators. We set random seeds properly to make all results reproducible. We used a laptop to do the experiments, and provided the code to generate the figures. Code available at https://github.com/pilancilab/Loss-landscapeconvex-duality

Section: APPENDIX A DETAILS ON THE TOY EXAMPLE IN FIGURE 2
In this section, we give details of the toy example in Figure 2. Specifically, we illustrate how the loss landscape is plotted, how the set of optimal solutions is derived, and present the models that are found by gradient descent.
One important remark is that "the figure does not directly imply the staircase of connectivity" -the fact that two optimal solutions are disconnected in the visualization does not mean disconnectedness in the optimal solution, and vice versa. The figures are for the illustration of the phenomenon, not the proof.
The optimization problem that we consider is
min {(θi,ai,bi)} m i=1 1 2 2 j=1 m i=1 (a i x j + b i ) + θ i -y j 2 + β 2 m i=1 (θ 2 i + a 2 i + b 2 i ),
where
{(x i , y i )} 2 i=1 = {(- √ 3, 1), ( √ 3, 1)} and β = 0.1. When we write X = - √ 3, 1 √ 3, 1 ∈ R 2×2 , y = [1, 1]
T , and the first layer weights as U ∈ R 2×m , second layer weights v = R m , the optimization problem can also be written as
min U ∈R 2×m ,v∈R m 1 2 ∥(XU ) + v -y∥ 2 2 + β 2 (∥U ∥ 2 F + ∥v∥ 2 2 ).
Let the objective be L(U, v). Note that even when m = 1, there are three parameters, so it is impossible to plot the loss landscape in a three-dimensional plot. What we do is plot a certain section of the loss landscape, as done in Li et al. (2018), to demonstrate our result.
When m = 1, where r = √ 1 -0.5β, we plot
F (t, s) = L( t s , [r]), for (t, s) ∈ [-1, 1] × [-0.5, 2]. t = 0, s = r is the only optimum.
When m = 2, where r = √ 1 -0.5β, we define
U 0 = 0 0 r 0 , U 1 = √ 3r/(2 √ 2) - √ 3r/(2 √ 2) r/(2 √ 2) r/(2 √ 2) , U 2 = 0 0 0 r , v 0 = r 0 , v 1 = r/ √ 2 r/ √ 2 , v 2 = 0 r .
Then, we plot
F (t, s) = L(cos(t)U 0 + 2s(U 1 -U 0 ) + sin(t)U 2 , cos(t)v 0 + 2s(v 1 -v 0 ) + sin(t)v 2 ). for (t, s) ∈ [-0.25, 0.6] × [-0.5, 0.3].
The optimal solutions here are (t, s) = (0, 0.5) and the line s = 0, t ≥ 0.
When m = 3, where r = √ 1 -0.5β, we define
U 0 = 0, 0, 0 r, 0, 0 , U 1 = 0 √ 3r/(2 √ 2) - √ 3r/(2 √ 2) 0 r/(2 √ 2) r/(2 √ 2) , U 2 = 0, 0, 0 0, r, 0 , v 0 = r 0 0 , v 1 =   0 r/ √ 2 r/ √ 2   , v 2 = 0 r 0 .
Then, we plot
F (t, s) = L(cos(t) cos(s)U 0 +cos(t) sin(s)U 1 +sin(t)U 2 , cos(t) cos(s)v 0 +cos(t) sin(s)v 1 +sin(t)v 2 ). for (t, s) ∈ [-0.5, 1] × [-0.5, 1].
The optimal solutions are s = 0, t ≥ 0 and t = 0, s ≥ 0.
The contour plot of the loss landscape can be found in Figure 4. It clearly shows that the connectivity behavior of the optimal solution changes. Figure 5 gives what the gradient descent actually learns for the problem. We can see that gradient descent finds multiple optimal solutions, which verifies our claim that we have a continuum of optimal solutions. We present the case both when m = 3 and m = 5. When m is increased, the model gets less stuck at local minima. At last, we show that all optimal functions can be written as
f (x) = √ κt( √ 3κt 2 x + √ κt 2 ) + + √ κt(- √ 3κt 2 x + √ κt 2 ) + + κ(1 -2t) κ(1 -2t) + , where κ = 1 -β/2 and t ∈ [0, 1/2]. For ν T = [1/2, 1/2], we know that max ∥u∥2≤1 |ν T (Xu) + | = 1.
Let's say the optimal model fit
y * = m i=1 (Xu i ) + α i . Then, ⟨ν, y * ⟩ ≤ m i=1 |ν T (X u i ∥u i ∥ 2 )|∥u i ∥ 2 |α i | ≤ 1 2 m i=1 ∥u i ∥ 2 2 + |α i | 2 .
This means the objective has a lower bound 1 2 ∥y * -y∥ 2 2 + β⟨ν, y * ⟩, and minimum of the lower bound is attained when y * = [1 -β/2, 1 -β/2] T . Substitute to see that the lower bound of the objective is β -β 2 /4, and when u 1 = [0, 1 -β/2] T , α 1 = 1 -β/2 we have a solution with cost β -β 2 /4 hence y * is indeed optimal. ν * = y -y * , and use Theorem 1 to find the complete solution set.

Section: B DETAILS ON THE TOY EXAMPLE IN FIGURE 3
The construction of x follows from Proposition 3. For the particular example we distribute the angles identically, hence we obtain the form in Example 2. The six optimal directions are
√ 3/2 1/2 , √ 2/2 √ 2/2 , 1/2 √ 3/2 , √ 6 - √ 2/4 √ 6 + √ 2/4 , 0 1 , - √ 3/2 1/2 .
Let's note these optimal directions as ū1 , ū2 , • • • ū6 . We construct y as
y = 20((X ū1 ) + + (X ū3 ) + + (X ū5 ) + ),
which is numerically very similar to [94,29,24,20,20]. Note that the class of optimal interpolators are
f (x) = (20 -7.076t)([x, 1] • ū1 ) + + (13.1592t)([x, 1] • ū2 ) + + (20 -13.1623t)([x, 1] • ū3 ) + + (13.159t)([x, 1] • ū4 ) + + (20 -7.081t)([x, 1] • ū5 ) + + t([x, 1] • ū6 ) + ,
where t ∈ [0, 1.5194] which all have the same optimal cost 60.
Similar to the experiment in Appendix A, we give an example of the learned functions by gradient descent in Figure 6. We set β = 0.1 and solve the regularized problem. Here we find multiple functions as optimal: and the important remark is that not all (as some do stuck at local minima), but there exists different interpolators with the same cost. C PROOFS IN SECTION 3.1
In this section, we briefly discuss how Theorem 1 is derived and the intuition behind it. Consider the problem introduced in equation 4, and write its optimal solution set as Θ * . To discuss the solution set, we first define the set of optimal model fits, which was first introduced in Mishkin & Pilanci (2023). Definition C.1. Mishkin & Pilanci (2023) The set of optimal model fits C y is defined as
C y = P i=1 D i X(u * i -v * i ) | (u * i , v * i ) P i=1 ∈ Θ * .
When L is strictly convex, which is the case for l 2 regression for instance, C y becomes a singleton Mishkin & Pilanci (2023).
Proposition C.1. (Proposition 1 of the paper) Mishkin & Pilanci (2023) If the loss function L is strictly convex, the optimal model fit is unique, i.e. for the set of optimal model fit
C y = P i=1 D i X(u * i -v * i ) | (u * i , v * i ) P i=1 ∈ Θ * , C y = {y * } for some y * ∈ R n . Proof. Assume y 1 , y 2 ∈ C y and y 1 ̸ = y 2 . Let P i=1 D i X(u i -v i ) = y 1 and P i=1 D i X(u ′ i -v ′ i ) = y 2 for (u i , v i ) P i=1 , (u ′ i , v ′ i ) P i=1 ∈ Θ * . Think of ( ui+u ′ i 2 , vi+v ′ i 2 ) P i=1 = θ avg . The objective value of θ avg is L( y 1 + y 2 2 , y) + β P i=1 ∥ u i + u ′ i 2 ∥ 2 + ∥ v i + v ′ i 2 ∥ 2
which is strictly smaller than
1 2 L(y 1 , y) + β P i=1 ∥u i ∥ 2 + ∥v i ∥ 2 + L(y 2 , y) + β P i=1 ∥u ′ i ∥ 2 + ∥v ′ i ∥ 2 .
The strict inequality follows from the fact that L is strictly convex. Contradiction follows, as we have found a parameter that has smaller objective value than the optimal cost.
It is not necessary to characterize C y as a singleton to derive the solution set itself (see Section 4.1). However, for the notion of the optimal polytope and its application to the staircase of connectivity, we will need that C y is a singleton.
Before proving the optimal polytope characterization, we show that the ūi , vi introduced in Theorem 1 can be uniquely determined by solving the given optimization problem.
Proposition C.2. Consider the optimization problem
min u∈Si ν T D i Xu, min u∈Si -ν T D i Xu,
where ν ∈ R n is an arbitary vector and
S i = K i ∩ {u | ∥u∥ 2 ≤ 1}.
If the optimal objective is nonzero, there exists a unique minimizer.
Proof. The problem is equivalent to
min u∈Si (w * ) T u,
which is a linear program on a convex set. We write w * = ±X T D i ν * for convenience. Let's say the optimal objective p * < 0 and we have two minimizers u * 1 , u * 2 . The first thing to notice is that ∥u * 1 ∥ 2 = 1. The reason is that when ∥u * 1 ∥ 2 < 1, we can scale it to decrease the objective. Similarly, ∥u * 2 ∥ 2 = 1. As they are two different minimizers, we know that
(w * ) T u * 1 = (w * ) T u * 2 = p * = (w * ) T ( u * 1 + u * 2 2 ),
and
∥u * 1 + u * 2 ∥ 2 < 2 because u * 1 ̸ = u * 2 . Scale (u * 1 + u * 2 )/2 to obtain contradiction that u * 1 is the minimizer.
Theorem C.1. (Theorem 1 of the paper) Suppose L is a strictly convex loss function. The directions of optimal parameters of the problem in equation 4, noted as ūi , vi , are uniquely determined from the dual problem,
ūi = arg min u∈Si ν * T D i Xu if min u∈Si ν * T D i Xu = -β, 0 otherwise, vi = arg min v∈Si -ν * T D i Xv if min v∈Si -ν * T D i Xv = -β, 0 otherwise.
where ν * is any dual optimum that satisfies
ν * = arg max -L * (ν) subject to |ν T D i Xu| ≤ β∥u∥ 2 ∀u ∈ K i , i ∈ [P ].
Here, D i s are all possible arrangements diag(1
(Xh ≥ 0)) for i ∈ [P ], S i = K i ∩ {u | ∥u∥ 2 ≤ 1}.
Moreover, the solution set of equation 4 is given as a polytope,
P * ν * := (c i ūi , d i vi ) P i=1 | c i , d i ≥ 0 ∀i ∈ [P ], P i=1 D i X ūi c i -D i X vi d i = y * ⊆ R 2dP ,(13
) where y * is the unique optimal fit satisfying C y = {y * }.
Proof. Let's note Θ * as the solution set of equation 4. Also, fix ν * to be any dual optimum. The directions ūi , vi are uniquely determined from Proposition C.2. Define P * to be the set defined in equation 13: note the dependence of P * has with ν * (though we will see that for any choice of ν * , P * ν * = Θ * and the choice of ν * does not matter).
We first show that
Θ * ⊆ P * ν * . Take a point (u * i , v * i ) P i=1 ∈ Θ * . We first know that P i=1 D i X(u * i - v * i ) = y * from Proposition 1. What we would like to do is showing the existence of c i , d i that satisfies c i ≥ 0, u * i = c i ūi , d i ≥ 0, v * i = d i vi , where ūi , vi are ūi = arg min u∈Si ν * T D i Xu if min u∈Si ν * T D i Xu = -β, 0 otherwise, vi = arg min v∈Si -ν * T D i Xv if min v∈Si -ν * T D i Xv = -β, 0 otherwise. Consider the Lagrangian L((u i , v i ) P i=1 , z, ν) = L(z, y) -ν T z + P i=1 (β∥u i ∥ 2 + ν T D i Xu i ) + P i=1 (β∥v i ∥ 2 -ν T D i Xv i ),
where u i , v i ∈ K i . We can see that
min ui,vi∈Ki,z max ν L((u i , v i ) P i=1 , z, ν) = max ν min ui,vi∈Ki,z L((u i , v i ) P i=1 , z, ν),
because ν is the dual variable that is only related to linear constraints. We can prove the fact rigorously by following the reasoning in Boyd & Vandenberghe (2004). We prove the fact for completeness.
First, we define the set
A = {(w - P i=1 D i X(u i -v i ), t) | u i , v i ∈ K i , L(w, y) + β P i=1 ∥u i ∥ 2 + ∥v i ∥ 2 ≤ t},
where A ⊆ R n × R. s A is a convex set. Now, denote the optimal value of problem in equation 4 as p * . When we define
B = {(0, s) | s < p * }, it is clear that A ∩ B = ∅.
By the separating hyperplane theorem, there exists
(ν, μ) ∈ R n × R which is nonzero, α such that (z, t) ∈ A ⇒ νT z + μt ≥ α ≥ μp * ,
and we also know μ ≥ 0: else t → ∞ and contradiction follows. If μ > 0 we have L((u i , v i ) P i=1 , z, ν/μ) ≥ p * , and strong duality follows. If μ = 0, we conclude that for all (u i , v i ) P i=1 , z we have that (ν) T (z -
P i=1 D i X(u i -v i ))
≥ 0, which is simply impossible. Hence, μ > 0 and strong duality holds.
Moreover, the dual problem
max ν min (ui,vi) P i=1 ,z L((u i , v i ) P i=1 , z, ν) writes maximize -L * (ν) subject to β∥u∥ 2 ≥ |ν T D i Xu| ∀u ∈ K i , i ∈ [P ].
The reason is the following: suppose for some ν ′ ∈ R n , there exists
u ′ i that satisfies u ′ i ∈ K i and ν ′T D i Xu ′ i + β∥u ′ i ∥ 2 < 0. As we can scale t → ∞ for tu ′ i to see that for that ν ′ , min (ui,vi) P i=1 ,z L((u i , v i ) P i=1 , z, ν ′ ) = -∞.
Hence, this ν ′ cannot be the dual optimum. This means we only need to see the ν that satisfies ν
T D i Xu + β∥u∥ 2 ≥ 0 for all u ∈ K i , i ∈ [P ]. Similarly, we only need to see ν that satisfies -ν T D i Xu + β∥u∥ 2 ≥ 0 for all u ∈ K i , i ∈ [P ]. Hence, ν * is the maximizer of max ν min z L(z, y) -ν T z subject to β∥u∥ 2 ≥ |ν T D i Xu| ∀u ∈ K i , i ∈ [P ],
and the rest follows.
As strong duality holds, for any primal optimum Boyd & Vandenberghe (2004). Note that z * is always y * due to Proposition 1, and replaced by it. Now, as
((u * i , v * i ) P i=1 , y * ), the function L((u i , v i ) P i=1 , z, ν * ) attains minimum at ((u * i , v * i ) P i=1 , y * )
L((u i , v i ) P i=1 , z, ν * ) = L(z, y) -ν * T z + P i=1 (β∥u i ∥ 2 + ν * T D i Xu i ) + P i=1 (β∥v i ∥ 2 -ν * T D i Xv i ),
each u * i becomes the minimizer of β∥u∥ 2 + ν * T D i Xu subject to u ∈ K i and each v * i becomes the minimizer of β∥u∥ 2 -ν * T D i Xu subject to u ∈ K i . Recall that ν * is a vector that satisfies β∥u∥ 2 ≥ |ν T D i Xu| ∀u ∈ K i , i ∈ [P ], and when u = 0, both β∥u∥ 2 +ν * T D i Xu and β∥u∥ 2 -ν * T D i Xu has function value 0. This implies that the minimum of both β∥u∥ 2 +ν * T D i Xu and β∥u∥
2 -ν * T D i Xu subject to u ∈ K i is 0 for all i ∈ [P ]. As ((u * i , v * i ) P i=1 , y * ) minimizes L((u i , v i ) P i=1 , z, ν * ), β∥u * i ∥ 2 + ν * T D i Xu * i = 0, β∥v * i ∥ 2 -ν * T D i Xv * i = 0.
We will find c i ≥ 0, and finding d i will be identical. Let's divide into cases.
i) When u * i = 0, let c i = 0 to find c i ≥ 0 that satisfies u * i = c i ūi . ii) When u * i ̸ = 0, notice that min u∈Si ν * T D i Xu = -β ̸ = 0,
and the optimum is attained at
u * i /∥u * i ∥ 2 . To see this, recall that (ν * ) T D i Xu + β∥u∥ 2 ≥ 0 and (ν * ) T D i Xu/∥u∥ 2 ≥ -β for all nonzero u ∈ K i , which implies that min u∈Si (ν * ) T D i Xu = -β.
Furthermore, by Proposition C.2, there exists a unique minimizer of the problem min u∈Si (ν * ) T D i Xu, and
u * i /∥u * i ∥ 2 = ūi . Hence choosing c i = ∥u * i ∥ 2 gives c i ≥ 0 that satisfies u * i = c i ūi . Hence, we have found c i ≥ 0, d i ≥ 0 that satisfies u * i = c i ūi , v * i = d i vi and P i=1 D i X(u * i -v * i ) = P i=1 D i X(c i ūi -d i vi ) = y * , meaning (u * i , v * i ) P i=1 ∈ P * . Now, we show that P * ν * ⊆ Θ * . Take an element (c i ūi , d i vi ) P i=1 ∈ P * ν * . It is clear that c i ūi ∈ C i , d i vi ∈ D i . If ūi ̸ = 0, we know that (ν * ) T D i X ūi = -β. Similarly, if vi ̸ = 0, we know that -(ν * ) T D i X vi = -β. Also, if ūi , vi ̸ = 0, ∥ū i ∥ 2 = 1, ∥v i ∥ 2 = 1,
see the proof of Proposition C.2 why this holds. Now, let's calculate the objective of (c i ūi , d i vi ) P i=1 . We know
P i=1 D i X(c i ūi -d i vi ) = y * , hence the objective becomes L(y * , y) + β ūi̸ =0 c i + β vi̸ =0 d i , using ∥ū i ∥ 2 = 1, ∥v i ∥ 2 = 1. Now, as P i=1 D i X(c i ūi -d i vi ) = y * , multiplying (ν * ) T on both sides gives ūi̸ =0 c i + vi̸ =0 d i = -⟨ν * , y * ⟩/β.
Hence, the calculated objective becomes L(y * , y) -⟨ν * , y * ⟩, hence for all points in P * ν * , the objective becomes constant. We already know that Θ * ⊆ P * ν * . This means all points in P * ν * have the same optimal objective value, and P * ν * ⊆ Θ * . This finishes the proof. 
w j ∥w j ∥ 2 | (w i , α i ) m i=1 ∈ Θ C , w j ̸ = 0 ,
is finite, and each direction is determined by the dual optimum of the subsampled convex program.
Proof. From Ergen & Pilanci (2023), we know that all Clarke stationary points of equation 3 have a corresponding subsampled convex problem. More specifically, for any
(w i , α i ) m i=1 ∈ Θ C , we have a convex program with subsampled arrangement patterns D1 , D2 , • • • Dm ∈ {D i } P i=1 , min ui,vi∈ Ki L m i=1 Di X(u i -v i ), y + β m i=1 ∥u i ∥ 2 + ∥v i ∥ 2 ,
and a solution mapping
(w i , α i ) = (u i / ∥u i ∥ 2 , ∥u i ∥ 2 ) if u i ̸ = 0, (v i / ∥v i ∥ 2 , -∥v i ∥ 2 ) if v i ̸ = 0.
Hence, the set of first-layer directions of Clarke stationary points is contained in the set of optimal directions of the subsampled convex program. As there are only finitely many subsampled convex programs, and each convex program has a unique set of fixed optimal directions, we know that the set
m j=1 w j ∥w j ∥ 2 | (w i , α i ) m i=1 ∈ Θ C , w j ̸ = 0 ,
is a finite set. Furthermore, applying Theorem 1 to the subsampled convex program leads to the wanted result.

Section: D PROOFS IN SECTION 3.2
In this section we prove Theorem 2, using the cardinality-constrained optimal polytope P * (m) defined in equation 7. One thing to have in mind is that we are not trying to prove that P * (m) and Θ * (m), the optimal set of the original problem in equation 3 with width m, are homeomorphic. Rather, we will argue that certain mappings enable us to link the connectivity behavior between P * (m) and Θ * (m) to arrive at Theorem 2.
We first start by defining some relevant concepts. As a starting point, we define the cardinality of a solution.
Definition D.1. The cardinality of a solution (u i , v i ) P i=1 ∈ P * is defined as
card((u i , v i ) P i=1 ) = P i=1 1(u i ̸ = 0) + 1(v i ̸ = 0).
We introduce the cardinality-constrained optimal polytope again. Definition D.2. The cardinality constrained optimal polytope P * (m) is defined as the set
P * (m) := (u i , v i ) P i=1 | (u i , v i ) P i=1 ∈ P * , card((u i , v i ) P i=1 ) ≤ m ⊆ R 2dP . (14
)
One remark is that the largest possible cardinality of P * may be huge: in worst case, it could be that the largest cardinality is in a scale of P , which is the number of all possible arrangement patterns.
In general, the number of arrangement patterns are O(n d ), hence P * (m) consists of a very small portion of P * .
The next concept we introduce is the notion of irreducible solutions. This set can be understood as a set of minimal networks discussed in Mishkin & Pilanci (2023), and is used to define the critical widths of the staircase. Definition D.3. The irreducible solution set is defined as the set
P * irr = (u i , v i ) P i=1 | (u i , v i ) P i=1 ∈ P * , {D i Xu i } ui̸ =0 ∪ {D i Xv i } vi̸ =0 linearly independent . (15
)
One intuition of the irreducible solution set is that it is the set of "smallest solutions Mishkin & Pilanci (2023)
": if the set {D i Xu i } ui̸ =0 ∪ {D i Xv i } vi̸ =0
is linearly dependent we can find a strictly smaller conic combination using the vectors from
{D i Xu i } ui̸ =0 ∪ {-D i Xv i } vi̸ =0 .
The set P * irr can be understood as a collection of solutions obtained from repeating this "pruning step" -a step that finds smaller solutions using linear dependence. A more rigorous definition of pruning is the following. Note that the existence of m with a nonzero solution in Θ * (m) implies P * irr ̸ = ∅, which is equivalent to having a nonzero element in P * . We assume the nontrivial case where P * has a nonzero element from now on. Proposition D.1. The following three statements are equivalent: i) There exists m that satisfies
(w i , α i ) m i=1 ̸ = 0 ∈ Θ * (m) ii) There exists (u i , v i ) P i=1 ̸ = 0 ∈ P * iii) P * irr ̸ = ∅ Proof. i) ⇒ ii): First assume m ≥ 2P . Consider Φ((w i , α i ) m i=1 ) = (u i , v i ) P i=1
, where Φ is defined in Definition D.8. We know that (u i , v i ) P i=1 is not a solution of zeros, as if it were the case, m i=1 (Xw i ) + α i = 0 and (0, 0) m i=1 would have a strictly smaller objective, contradicting
(w i , α i ) m i=1 ∈ Θ * (m)
. Now when we write the optimal value of the nonconvex objective in Equation (3) with p * m , and the optimal value of the convex objective in Equation ( 4) with p * , p * m ≤ p * . Also, when we write L as the convex objective, we know that L((u i , v i ) P i=1 ) = p * m ≥ p * . Hence we have found a nonzero (u i , v i ) P i=1 that has objective value p * , which means there is a point in P * which is nonzero. Now let m < 2P . Assume P * = {(0, 0) P i=1 }. We know that p * ≤ p * m , and p * = 1 2 ∥y∥ 2 2 , hence
1 2 ∥y∥ 2 2 ≤ p * m .
On the other hand, the value 1 2 ∥y∥ 2 2 is achievable by setting (0, 0) m i=1 -which means p * m = p * . With the same logic, Φ((w i , α i ) m i=1 ) is not a solution of zeros and its objective value is same as p * m which is p * -leading to a contradiction that P * = {(0, 0) P i=1 } since (u i , v i ) P i=1 is nonzero and in P * . ii) ⇒ iii): We use the pruning step in Definition D.4 to find an element in P * irr . Note that the pruning step does not end with 0, as (u i , v i ) P i=1 is not zero and the pruning step should not decrease the objective. iii) ⇒ ii): As P * irr ̸ = ∅, there is a nonzero solution nz ∈ P * irr . As P * irr ⊆ P * , we know the existence of a nonzero solution in P * . ii) ⇒ i): Set m = 2P and consider Ψ((u i , v i ) P i=1 ), where Ψ is defined in Definition D.7. Definition D.4. (Mishkin & Pilanci (2023)) Pruning a solution (u i , v i ) P i=1 ∈ P * means repeating: 1. Finding a nontrivial linear combination
ui̸ =0 c i D i Xu i + vi̸ =0 d i D i Xv i = 0,
and without loss of generality assume d 1 > 0.
2. Constructing a solution with strictly less cardinality andc i , d i are defined to be the coefficients defined in 1 when u i , v i ̸ = 0 and 0 otherwise. until the set
(u ′ i , v ′ i ) P i=1 = ((1 + c i t)u i , (1 -d i t)v i ) P i=1 , where t = min{min ci<0 -1 ci , min di>0 1 di },
{D i Xu i } ui̸ =0 ∪ {D i Xv i } vi̸ =0 is linearly independent.
The notion of minimality gives a discrete structure in P * irr , hence the phase transitional behavior follows. The two critical widths of interest are the minimum / maximum cardinality of P * irr . We denote
m * := min (ui,vi) P i=1 ∈P * irr card((u i , v i ) P i=1 ), M * := max (ui,vi) P i=1 ∈P * irr card((u i , v i ) P i=1 ).
Remark D.1. These widths can be found computationally by the following scheme: for t = 1 to n, choose t vectors from the set {D i X ūi } P i=1 ∪{D i X vi } P i=1 , where ūi , vi are optimal directions defined in Theorem 1. Check if they are linearly independent and can express y * as the conic combination of the t vectors. The first value of t that meets both criteria becomes m * , and M * will be updated each time t meets both criteria until t becomes n.
The two specific discontinuity results we can achieve are the following: Proposition D.2. P * (m * ) is a finite set.
Proof. Consider two points (u i , v i ) P i=1 , (u ′ i , v ′ i ) P i=1 ∈ P * (m * ). Suppose the two points have the same support, i.e.
u i ̸ = 0 ⇔ u ′ i ̸ = 0 and v i ̸ = 0 ⇔ v ′ i ̸ = 0 for i ∈ [P ].
We know that as
P * (m * ) ⊆ P * , P i=1 D i X(u i -v i ) = y * = P i=1 D i X(u ′ i -v ′ i ).

Section: Now, let's write the indices {i|u
i ̸ = 0} = {a 1 , a 2 , • • • a t }, {i|v i ̸ = 0} = {b 1 , b 2 , • • • b s }. We have that t + s ≤ m * as (u i , v i ) P i=1 ∈ P * (m * ). From Theorem 1, we know the existence of c ai , c ′ ai ≥ 0 for i ∈ [t] and d bi , d ′ bi ≥ 0 for i ∈ [s] that satisfies u ai = c ai ūai , u ′ ai = c ′ ai ūai , ∀i ∈ [t], v bi = d bi vbi , v ′ bi = d ′ bi vbi , ∀i ∈ [s]. This means that t i=1 c ai D ai X ūai - s i=1 d bi D bi X vbi = t i=1 c ′ ai D ai X ūai - s i=1 d ′ bi D bi X vbi = y * ,
and as c ai , d bi s are not all the same, we have that the set
{D ai X ūai } t i=1 ∪ {D bi X vbi } s i=1
is linearly dependent. Now we apply pruning defined in Definition D.4 to find an irreducible solution with cardinality strictly less than t + s = m * , which is a contradiction to the minimality of m * . This implies that two different points in P * (m * ) cannot have identical support, and the number of points in the set is upper bounded with the number of possible support, which is finite. More specifically,
|P * (m * )| ≤ m * j=1 2P j .
Proposition D.3. P * (M * ) has an isolated point, i.e. it has a point p ∈ P * (M * ) that has no path in P * (M * ) from p to a different point p ′ .
Proof. Take the maximal-cardinality solution from
(u • i , v • i ) P i=1 ∈ P * irr , namely the solution (u • i , v • i ) P i=1 ∈ P * , {D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0 linearly independent, and card((u • i , v • i ) P i=1 ) = M * . Assume the existence of a continuous function f : [0, 1] → P * (M * ) satisfying f (0) = (u • i , v • i ) P i=1 , f (1) = (u ′ i , v ′ i ) P i=1 , f (0) ̸ = f (1). Now, write f (t) = (u i (t), v i (t)) P i=1 and define c i (t) = 0 if ūi = 0 ∥u i (t)∥ 2 otherwise, d i (t) = 0 if vi = 0 ∥v i (t)∥ 2 otherwise, For definition of ūi , vi , see Theorem 1. Some things to notice are: i) The functions c i (t), d i (t) : [0, 1] → R are continuous. ii) f (t) = (c i (t)ū i , d i (t)v i ) P
i=1 . This holds because if ūi ̸ = 0, ∥ū i ∥ 2 = 1, and same for vi . iii)
P i=1 1(c i (t) ̸ = 0) + 1(d i (t) ̸ = 0) ≤ M * , and P i=1 1(c i (0) ̸ = 0) + 1(d i (0) ̸ = 0) = M * .
The former holds because f is a path in P * (M * ), and the latter holds because (u
• i , v • i ) P i=1 has cardinality M * . iv) We know that there exists t ′ ∈ [0, 1] that satisfies (c i (t ′ ), d i (t ′ )) P i=1 ̸ = (c i (0), d i (0)) P i=1 . It is because f (0) ̸ = f (1).
Based on the observations, let's prove that if there exists such f , the set
{D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0
is linearly dependent. Thus we will arrive at a contradiction and will be able to show that there is no such f , and
(u • i , v • i ) P i=1 is isolated. Let's define t 1 as t 1 = inf t≥0 t | P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)) 2 > 0 .
In other words, t 1 is the instant where f (0) ̸ = f (t). From observation iv), we know that t 1 ∈ [0, 1]. Another fact that we can deduce is that
P i=1 (c i (0) -c i (t 1 )) 2 + (d i (0) -d i (t 1 )) 2 = 0. (16
)
The reason is because if
P i=1 (c i (0) -c i (t 1 )) 2 + (d i (0) -d i (t 1
)) 2 > 0, we can find some ϵ that will make
P i=1 (c i (0) -c i (t 1 -ϵ)) 2 + (d i (0) -d i (t 1 -ϵ)) 2 > 0 because of continuity,
which is a contradiction that t 1 is the infremum (because we have found a smaller t that makes
P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)) 2 > 0). Hence, for t ∈ [0, t 1 ], P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)) 2 = 0.
At last, we know that for any ϵ > 0, there exists t ϵ ∈ (t 1 , t 1 + ϵ) that satisfies
P i=1 (c i (0) -c i (t ϵ )) 2 + (d i (0) -d i (t ϵ )) 2 > 0. (17
)
If there is no t ∈ (t 1 , t 1 + ϵ) that satisfies
P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)) 2 > 0 for some ϵ > 0, it means that for all t ∈ [0, t 1 + ϵ 2 ], P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)
) 2 = 0: hence, the infremum should be strictly larger than t 1 , which is a contradiction. Now let's prove the claim that if there exists such f , the set
{D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0 is linearly dependent. From equation 16, we know that c i (0) = c i (t 1 ), d i (0) = d i (t 1 ) ∀i ∈ [P ].
Take ϵ 0 > 0 sufficiently small so that for all i ∈ [P ] that satisfies c i (0) > 0, c i (t) > 0 for all t ∈ [t 1 -ϵ 0 , t 1 +ϵ 0 ], and for all i ∈ [P ] that satisfies d i (0) > 0, d i (t) > 0 for all t ∈ [t 1 -ϵ 0 , t 1 +ϵ 0 ]. Such ϵ 0 exists due to the continuity of c i , d i . Due to the definition, we know that
M * = P i=1 1(c i (0) ̸ = 0) + 1(d i (0) ̸ = 0) ≤ P i=1 1(c i (t) ̸ = 0) + 1(d i (t) ̸ = 0) ≤ M * ,
(see observation iii) if any confusion exists), for all t ∈ [t 1 -ϵ 0 , t 1 + ϵ 0 ], and
P i=1 1(c i (t) ̸ = 0) + 1(d i (t) ̸ = 0) = M * , ∀t ∈ [t 1 -ϵ 0 , t 1 + ϵ 0 ].

Section: This means that for all
t ∈ [t 1 -ϵ 0 , t 1 + ϵ 0 ], we know that c i (0) > 0 ⇔ c i (t) > 0 and d i (0) > 0 ⇔ d i (t) > 0.
For that ϵ 0 , we can find t ϵ0 that was defined in equation 17. As t ϵ0 satisfies
P i=1 (c i (0) -c i (t ϵ0 )) 2 + (d i (0) -d i (t ϵ0 )) 2 > 0, (c i (0), d i (0)) P i=1 ̸ = (c i (t ϵ0 ), d i (t ϵ0 )) P i=1 . Also, c i (0) > 0 ⇔ c i (t ϵ0 ) > 0 and d i (0) > 0 ⇔ d i (t ϵ0 ) > 0. Now we have found two different solutions (c i (0)ū i , d i (0)v i ) P i=1 , (c i (t ϵ0 )ū i , d i (t ϵ0 )v i ) P i=1 ∈ P * (M * ), which means that P i=1 c i (0)D i X ūi -d i (0)D i X vi = y * = P i=1 c i (t ϵ0 )D i X ūi -d i (t ϵ0 )D i X vi .(18)
As (c i (0),
d i (0)) P i=1 ̸ = (c i (t ϵ0 ), d i (t ϵ0 )) P i=1 and c i (0) > 0 ⇔ c i (t ϵ0 ) > 0, d i (0) > 0 ⇔ d i (t ϵ0
) > 0, we can see that equation 18 is two different linear combinations of the set
{D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0 -hence the set {D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0
is linearly dependent. As we claimed, we have arrived at a contradiction assuming a continuous path from (u • i , v • i ) P i=1 . Hence the point is isolated.
Next, we pay attention to the connectivity results of P * (m). Our starting point will be noticing that for any (u i , v i ) P i=1 ∈ P * which is not in P * irr , the pruning mechanism defined in Definition D.4 gives a continuous path into P * irr . This means that if we can connect two points in P * irr ∩ P * (m) with a continuous path in P * (m), the set
P * (m) is connected. Proposition D.4. Consider (u i , v i ) P i=1 ∈ P * -P * irr . Let m = card((u i , v i ) P i=1 ).
Then, there exists a continuous path in P * (m) that starts with (u i , v i ) P i=1 and ends with a different point
(u ′ i , v ′ i ) P i=1 ∈ P * (m) ∩ P * irr .
Proof. We will find such path by pruning the solution as in Definition D.4. For each iteration of the pruning step, the starting solution and the ending solution is connected by a continuous path.
As we iterate, we concatenate each continuous path, hence the resulting path should be continuous.
The next thing we have to check is that the path is contained in P * (m). We can see this due to the fact that in each pruning iteration, the cardinality of the solution does not increase, and the initial cardinality of the solution is m. At last, when pruning ends we arrive at a irreducible solution, meaning the final solution we get from pruning is in P * (m) ∩ P * irr .
We have two different strategies to prove the connectedness of P * (m). One is directly interpolating the two solutions in P * irr , and increasing m to guarantee the validity of such interpolation. From this, we obtain one critical width m * + M * . ). Now, take any two points A ′ , B ′ ∈ P * (m * + M * ). From Proposition D.4, there exists a continuous path in P * (m * +M * ) that starts from A ′ to a certain A irr ∈ P * (m * +M * )∩P * irr , and similarly there exists a path from B ′ to B irr . At last, there is a continuous path from A irr to B irr in P * (m * +M * ). Connect all paths to find a continuous path from
A ′ to B ′ in P * (m * + M * ).
Another strategy is more involved, which is not directly interpolating two solutions A, B in P * irr , but repeatedly interpolating A with parts of B until the two are connected with a path. We start with a particular lemma.
Lemma D.1. Suppose we have two linearly independent sets
A = {a 1 , a 2 , • • • a m }, B = {b 1 , b 2 , • • • b k } ⊆ R n and a given subset I = {a i1 , a i2 , • • • a it } ⊂ A. Also, m i=1 λ i a i = k i=1 µ i b i ,
for some λ ∈ R m that satisfies t j=1 λ ij > 0, and µ ∈ R k that satisfies µ > 0. Then, there exists a vector µ * ∈ R k that satisfies the following three properties:
1) ∥µ * ∥ 0 ≤ n -m + 1. 2) µ * ≥ 0. 3) k i=1 µ * i b i ∈ span({a 1 , a 2 , • • • a m }
) and when we express
k i=1 µ * i b i = m i=1 δ i a i , t j=1 δ ij > 0.
Proof. If k ≤ n -m + 1 there is nothing to prove. Assume k > n -m + 1. Showing the existence of a vector μ that satisfies ∥μ∥ 0 < ∥µ∥ 0 , μ ≥ 0 and
k i=1 μi b i ∈ span({a 1 , a 2 , • • • a m }), k i=1 μi b i = m i=1 δ i a i and t j=1 δ ij > 0,
is enough to prove our proposed claim. That is because if the existence of such μ is proved, we can apply the existence result again to
A = {a 1 , a 2 , • • • , a m } and B = {b i |i ∈ [k], μi ̸ = 0}
with the same I. The premises are all satisfied: from the definition of μ we know that
i∈[k],μi̸ =0 μi b i = m i=1 δ i a i , t j=1 δ ij > 0,
and μi > 0 if μi ̸ = 0. Moreover, | B| < |B|. This means if we prove the existence of such μ and iteratively apply the existence result as stated above, we will arrive at a set B • ⊆ B that satisfies
|B • | ≤ n -m + 1, bi∈B • µ • i b i = m i=1 δ • i a i , t j=1 δ • ij > 0, µ • i > 0 if b i ∈ B • , 0 otherwise.

Section: The reason why |B
• | ≤ n -m + 1 is that if |B • | > n -m + 1,
we can find a subset of B • with strictly less cardinality with the same property, hence it is not the terminal set. For that set, choosing µ • = µ * gives the wanted vector.
Now we show the existence of such μ when k > n -m + 1. We first extend {a 1 , a 2 , • • • a m } to a basis of R n , and note it as
{a 1 , a 2 , • • • a n }. Express each b i s as b i = n j=1 γ ij a j .
where
i ∈ [k]. Now, write Γ ∈ R n×k as Γ ij = γ ji .
What we know is the relation
Γµ = λ 0 n-m ,
which is simply the coordinate representation of
k i=1 µ i b i . Now we know the set {µ ∈ R k | 1 T Γ[i 1 , i 2 , • • • i t ]µ = 0, Γ[m + 1] T µ = 0, • • • Γ[n] T µ = 0} has dimension at least k -n + m -1, as each linear constraint decreases the dimension at most 1. Here Γ[p 1 , p 2 , • • • p r ] ∈ R r×k denotes the concatenation of r rows of Γ, Γ[p 1 ] to Γ[p r ]. As k -n + m -1 > 0, there exists a nonzero µ ′ that satisfies Γµ ′ = λ ′ 0 n-m , t j=1 λ ′ ij = 0.
For that µ ′ , consider µ + ϵµ ′ and either increase or decrease ϵ until the cardinality of µ decreases. As µ > 0, we can always find such ϵ that satisfies ∥µ + ϵµ ′ ∥ 0 < ∥µ∥ 0 . As we stop when the cardinality changes, µ + ϵµ ′ ≥ 0 should also hold. At last, we know that
Γ(µ + ϵµ ′ ) = λ + ϵλ ′ 0 n-m ,
and as
t j=1 λ ′ ij = 0, t j=1 (λ i J + ϵλ ′ ij ) > 0.
As the values of Γµ directly correspond to the coordinate representation of {a 1 , a 2 , • • • a n }, we know that µ + ϵµ ′ is the μ that we were looking for. This finishes the proof.
The necessary width in this case is n + 1, and we obtain the following result. Theorem D.1. The set P * (n + 1) is connected.
Proof. Similar to the proof of Proposition D.5, we show that for any two A, B ∈ P * (n + 1) ∩ P * irr , they are connected with a continuous path in P * (n + 1). The rest will directly follow.

Section: First, let's write
A = (u i , v i ) P i=1 , B = (u ′ i , v ′ i ) P i=1 . Also, let's write A = {D i X ūi } ui̸ =0 ∪ {-D i X vi } vi̸ =0 , B = {D i X ūi } u ′ i ̸ =0 ∪ {-D i X vi } v ′ i ̸ =0
, and note them as
A = {a 1 , a 2 , • • • a m } ⊆ R n , B = {b 1 , b 2 , • • • b k } ⊆ R n . At last, λ 1 , λ 2 , • • • λ m , µ 1 , µ 2 , • • • µ k are unique nonnegative numbers that satisfy m i=1 λ i a i = k i=1 µ i b i = y * .
The uniqueness follows from the fact that A, B ∈ P * (n + 1) ∩ P * irr , and the nonnegativeness follows from the optimal polytope characterization in Theorem 1. Furthermore, note that λ p > 0 for all p ∈ [m], µ q > 0 for all q ∈ [k]. For example, if a p = D i X ūi for some u i ̸ = 0, λ p = ∥u i ∥ 2 > 0. The rest is similar. Our main proof strategy will be finding k + m continuous functions
F 1 , F 2 , • • • F m , G 1 , G 2 , • • • , G k : [0, 1] → R that satisfies: Property 1) F i (0) = λ i , F i (1) = 0, F i (t) ≥ 0 ∀i ∈ [m], t ∈ [0.1]. Property 2) G j (0) = 0, G j (1) = µ j , G j (t) ≥ 0 ∀j ∈ [k], t ∈ [0.1]. Property 3) m i=1 F i (t)a i + k j=1 G j (t)b j = y * ∀t ∈ [0, 1]. Property 4) m i=1 1(F i (t) > 0) + k j=1 1(G j (t) > 0) ≤ n + 1 ∀t ∈ [0, 1].
First, let's see that if we find such continuous functions that satisfy Property 1) to Property 4), we can construct a path from A to B in P * (n + 1). The specific path we construct is:
(u i (t), v i (t)) P i=1
given as
u i (t) =        (F p (t) + G q (t))ū i if a p = b q = D i X ūi F p (t)ū i if a p = D i X ūi , ∄ q ∈ [k] such that b q = D i X ūi . G q (t)ū i if b q = D i X ūi , ∄ p ∈ [m] such that a p = D i X ūi 0 otherwise. v i (t) =        (F p (t) + G q (t))v i if a p = b q = -D i X vi F p (t)v i if a p = -D i X vi , ∄ q ∈ [k] such that b q = -D i X vi . G q (t)v i if b q = -D i X vi , ∄ q ∈ [m]such that a p = -D i X vi 0 otherwise.
Let's check that (u i (t), v i (t)) P i=1 is a path from A to B in P * (n + 1). As F i (1) = 0 for all i ∈ [m], G j (0) = 0 for all j ∈ [k], we can see that (u i (0), v i (0)) P i=1 = A, (u i (1), v i (1)) P i=1 = B. Also, we can see that it is a curve in P * : all u i (t), v i (t) are nonnegative multiples of ūi , vi , and we know that
P i=1 D i X(u i (t) -v i (t)) = m i=1 F i (t)a i + k j=1 G j (t)b j = y * .
Moreover, the cardinality of (u i (t), v i (t)) P i=1 bounded with
card((u i (t), v i (t)) P i=1 ) ≤ m i=1 1(F i (t) > 0) + k j=1 1(G j (t) > 0) ≤ n + 1
hence the proposed path becomes a continuous path in P * (n + 1). Now, we describe how we find such m + k continuous functions. We do:
Step 0)
Initialize C = A, f i (0) = λ i , g i (0) = 0.
Step 1) While T = 0, 1, • • • , repeat:
• If C ⊆ B, break.
• (Facts that hold from the previous iteration) Let's write • (Update 1) Now we update the values of f i , g j as the following:
C = {a i1 , a i2 , • • • a ir } ∪ {b j1 , b j2 , • • • b js }. We inductively have: 1) C is a linearly independent set. 2) f i (T ) ≥ 0 ∀i ∈ [m], g j (T ) ≥ 0 ∀j ∈ [k]. 3) f i (T ) > 0 ⇔ i ∈ {i 1 , i 2 , • • • i r }, g j (T ) > 0 ⇔ j ∈ {j 1 , j 2 , • • • j s }. 4)
f i (t) = f i (T ) = 0 if i / ∈ {i 1 , i 2 , • • • i r } f i (T ) -α λi (t -T ) if i ∈ {i 1 , i 2 , • • • i r }, t ∈ [T, T + 1/2], g i (t) = g i (T ) + αµ * i (t -T ) = αµ * i (t -T ) if i / ∈ {j 1 , j 2 , • • • j s } g i (T ) + αµ * i (t -T ) -αμ i (t -T ) if i ∈ {j 1 , j 2 , • • • j s }, t ∈ [T, T +1/2].
Here, α = 2 min{min λiw >0 f iw (T )/ λiw , min μjw >µ * jw g jw (T )/(μ jw -µ
* jw )} > 0. Update C so that f i (T + 1/2) > 0 ⇔ a i ∈ C, g i (T + 1/2) > 0 ⇔ b i ∈ C.
• (Update 2: Pruning) After this, we initialize r = 0, 
s i (0) = f i (T + 1/2), i ∈ [m], z j (0) = g j (T + 1/2), j ∈ [k] and repeat: (Check) If C is linearly independent, break (Update) Say C = {a r1 , a r2 , • • • a rx } ∪ {b s1 , b s2 , • • • b sy }.
s rw (r + t) = s rw (r) -αη w t, z sw (r + t) = z sw (r) -αη ′ w t for t ∈ [0, 1]. Here α = min{min ηw>0 s rw (r)/η w , min η ′ w >0 z rw (r)/η ′ w }. At last, update C so that s i (r + 1) > 0 ⇔ a i ∈ C, z i (r + 1) > 0 ⇔ b i ∈ C. Increase r by 1. • (Construct f i , g j for t ∈ [T + 1/2, T + 1]) Concatente f i and s i , g j and z j for all i ∈ [m], j ∈ [k].
Step 2) Let the termination time be T * . To obtain
F i , G j : [0, 1] → R, simply write F i (t) = f i (T /T * ), G j (t) = g j (T /T * ).
Let's first verify that the facts that hold from the previous iteration. First, C is a linearly independent set at the start of each iterate because of step (Update 2: Pruning), and the first fact holds. Also, f i , g j are updated only in steps (Update1) and (Update 2: Pruning), and we can see that for all t ∈ [0, T * ] the function values are nonnegative. In update 1, we chose α sufficiently small and µ * ≥ 0. In update 2, we also chose α sufficiently small. Hence, the second fact holds. At every update, we also update C, and the third fact holds. At last, we add a nontrivial linear combination of A ∪ B that sums up to 0 at each update, so the sum
m i=1 f i (t)a i + k j=1 g j (t)b j ,
is preserved to be y * , which means that the last fact follows.
One important argument to make is that the algorithm actually terminates, i.e. T * < ∞. The iteration in the second update terminates eventually, because at each iteration the cardinality of C decreases by 1. To see the larger loop terminating, observe that
m i=1 f i (t)
is a strictly decreasing function for t ∈ N. This is because in (Update 1), as r w=1 λiw > 0, the sum decreases, and in (Update 2), the sum does not increase because we suppose x w=1 η w ≥ 0. This means that at each starting step of the algorithm, identical C cannot appear twice: as C is linearly independent there exists a unique expression that gives Finally, let's check that we have found the right F i , G j s. As the algorithm terminated, C ⊆ B, and we know that
F i (1) = 0 for all i ∈ [m]. Also, as k j=1 G j (1)b j = y * ,
and the set B is linearly independent, we know that G j (1) = µ j for all j ∈ [k]. As previously mentioned, Property 3) is guaranteed as we are adding a nontrivial linear combination that sums up to 0 at each update. To see that Property 4) is true, we see the value of
m i=1 1(F i (t) > 0) + k j=1 1(G j (t) > 0)
for each update. In (Update 1), as ∥µ * ∥ ≤ n + 1 -t -s, the total cardinality does not exceed n + 1. In (Update 2), the cardinality always decreases. Hence, m i=1 1(F i (t) > 0) + k j=1 1(G j (t) > 0) ≤ n + 1 for all t ∈ [0, 1], and we know that we have actually found the wanted functions. This finishes the proof.
Recall that in Nguyen (2021), it is proved that the solution set is connected for m = n + 1 in the unregularized case. The proof strategy of Nguyen ( 2021) is first creating a zero entry in the second layer and changing the corresponding first layer weight arbitrarily. If the network is unregularized this is possible because the change in the corresponding first layer weight where the second layer weight is 0 will not change the model fit, hence the optimality. However, when we have regularization, such transformation is not possible as the first and second layer weights are tied together. This is why we need to use the characterization in Theorem 1 and Lemma D.1 to prove Theorem D.1. Overall, our result is a nontrivial extension of Nguyen (2021) to regularized networks.
At last, from Proposition D.6, we know that P * (m) is connected when m ≥ min{n + 1, m * + M * }. Proposition D.6. Suppose P * (m ′ ) is connected and m ′ ≥ M * . Then P * (m) is connected for all m ≥ m ′ .
Proof. Take two points A, B from P * (m). We know that there exists a path from A to A irr , B to B irr that satisfies A irr , B irr ∈ P * (m) ∩ P * irr . Notice that as A irr , B irr ∈ P * irr , there cardinality is at most M * . Hence they are elements of P * (m ′ ), which finishes the proof as P * (m ′ ) is connected. Now we connect the connectivity results of P * (m) to that of Θ * (m), the solution set of the original problem in equation 3. The object Θ * (m) we care about is precisely
Θ * (m) := (w i , α i ) m i=1 | min (wi,αi) m i=1 L m i=1 (Xw i ) + α i , y + β 2 m i=1 (∥w i ∥ 2 2 + |α i | 2 ) ⊆ R (d+1)m ,
We first define essential sets and mappings to do this.
Definition D.5. (Minimal Optimal Neural Networks) We say a parameter (w j , α j ) m j=1 is minimal optimal if (w j , α j ) m j=1 ∈ Θ * (m) and α p α q > 0 implies 1(Xw p ≥ 0) ̸ = 1(Xw q ≥ 0) for all p ̸ = q ∈ [m]. We denote the set of minimal optimal neural networks as Θ * min (m).
Adapting the proof from Wang et al. (2021b), we can show that for any point A ∈ Θ * (m), there is a continuous path from A to a point in Θ * min (m). The path is essentially merging the neurons with same arrangement patterns and second layer sign. Proposition D.7. Take any point A ∈ Θ * (m). We have a continuous path from A to some point
A min ∈ Θ * min (m) in Θ * (m).
Proof. Write A = (w j , α j ) m j=1 . Assume we have (w 1 , α 1 ) and (w 2 , α 2 ) that satisfies α 1 α 2 > 0 and 1(Xw 1 ≥ 0) = 1(Xw 2 ≥ 0). Let's write sign(α 1 ) = sign(α 2 ) = s. Define the curve
C(t) = ( w 1 α 1 + tw 2 α 2 ∥w 1 α 1 + tw 2 α 2 ∥ 2 , ∥w 1 α 1 + tw 2 α 2 ∥ 2 s) ⊕ ( √ 1 -t w 2 α 2 ∥w 2 α 2 ∥ 2 , (1 -t)∥w 2 α 2 ∥ 2 s) ⊕ (w j , α j ) m j=3 ,
where t ∈ [0, 1]. The intuition of this curve is merging two pair (w 1 , α 1 ) and (w 2 , α 2 ).
Let's check some basic facts to see that this curve indeed merges the two pairs and is in
Θ * (m). i) C(t) is well-defined. First, we know ∥w 2 α 2 ∥ 2 ̸ = 0, because α 2 ̸ = 0. Also, say w 1 α 1 + (1 - t)w 2 α 2 = 0 for some t ∈ [0, 1]. Then we have DXw 1 α 1 = -(1 -t)DXw 2 α 2 , where D = diag(1(Xw 1 ≥ 0)).
As DXw 1 and DXw 2 is consisted of nonnegative entries and α 1 α 2 > 0,
DXw 1 α 1 = 0 must hold. This means w 1 = α 1 = 0 because A ∈ Θ * (m) -which is again contradiction because α 1 ̸ = 0. The well-definedness of C(t) implies that it is continuous, because it is a composition of continuous functions. ii) C(0) = A, C(1) = ( w1α1+w2α2 √ ∥w1α1+w2α2∥2 , ∥w 1 α 1 + w 2 α 2 ∥ 2 s) ⊕ (0, 0) ⊕ (w j , α j ) m j=3 from direct substitution. Note that the value m i=1 1(α i ̸ = 0) decreased by 1. iii) C(t) is a curve in Θ * (m)
. This is because the sum m i=1 (Xw i ) + α i is preserved through the curve, and the regularization loss is less than that of A due to triangular inequality. In other words, the loss L(C(t)) ≤ L(C(0)) for all t ∈ [0, 1], and as L(C(0)) is optimal, C(t) is a curve in Θ * (m).
We repeat the merging process until there is no such pair. This process should terminate because each merging decreases m i=1 1(α i ̸ = 0) by 1. After we don't have such pair, concatenate all the curves that we have to find a curve in Θ * (m). At the end of the path, we don't have two (w i , α i ), (w j , α j ) that satisfy α i α j > 0 and 1(Xw i ≥ 0) = 1(Xw j ≥ 0), hence it is in Θ * min (m).
Also, we define the notion of a canonical polytope. A canonical polytope is defined to break ties that occur because two cones K i and K j may have nonempty intersections. For instance, say D 1 Xu 1 = D 2 Xu 1 and u 2 = 0 for some (u i , v i ) P i=1 ∈ P * . Then, swapping (u 1 , v 1 ) and (u 2 , v 2 ) will not change the solution's optimality. As we will see later on, we will want to erase such ambiguity, hence we consider a canonical polytope. Definition D.6. (Canonical Polytope) The canonical polytope is defined as
P * can = (u i , v i ) P i=1 | (u i , v i ) P i=1 ∈ P * , diag(1(Xu i ≥ 0)) =D i if u i ̸ = 0, diag(1(Xv i ≥ 0)) = D i if v i ̸ = 0 . Remark D.2. diag(1(Xu ≥ 0)) = D i implies (2D i -I)Xu ≥ 0,
but not the opposite. The ambiguity happens because x j • u might be 0 for some rows.
Given the notion of the minimal optimal neural network and the canonical polytope, we define two natural mappings Ψ : P * (m) → Θ * (m) and Φ : Θ * (m) → P * (m). These mappings have been discussed multiple times in the literature Pilanci & Ergen (2020), Wang et al. (2021b), and we introduce it again with slight variations for our needs. Definition D.7. Suppose m ≥ m * . We define Ψ :
P * (m) → Θ * (m) as Ψ((u i , v i ) P i=1 ) := ( u i ∥u i ∥ 2 , ∥u i ∥ 2 ) ui̸ =0 ⊕ ( v i ∥v i ∥ 2 , -∥v i ∥ 2 ) vi̸ =0 ⊕ (0, 0) m-card((ui,vi) P i=1 ) , Definition D.8. Suppose m ≥ m * . We define Φ : Θ * (m) → P * (m) as Φ((w i , α i ) m i=1 ) = (u i , v i ) P i=1 := u p = i∈I w i |α i | where I = {i | α i > 0, D p = diag(1(Xw i ≥ 0))} v q = i∈I w i |α i | where I = {i | α i < 0, D q = diag(1(Xw i ≥ 0))}.
The mappings are indeed well-defined Proposition D.8. Proposition D.8. Suppose m ≥ m * . Ψ : P * (m) → Θ * (m) and Φ : Θ * (m) → P * (m) are well defined.
Proof. By well-defined, we want to see that for all A ∈ P * (m), Ψ(A) ∈ Θ * (m) and has a unique value, and similarly for Φ too. From Definition D.7 and Definition D.8, it is not hard to see that the function value is uniquely determined for each input. Also, from direct calculation, we can see that
L   m j=1 (Xw j ) + α j , y   + β 2 m j=1 ∥w j ∥ 2 2 + α 2 j = L P i=1 D i X(u i -v i ), y +β P i=1 (∥u i ∥ 2 + ∥v i ∥ 2 ) ,
for both A = (u i , v i ) P i=1 and Ψ(A) = (w j , α j ) m j=1 and when A = (w j , α j ) m j=1 and Φ(A) = (u i , v i ) P i=1 . The former case is rather clear. To see the latter case, first observe that when we apply the merging operation in Proposition D.7, the loss will strictly decrease if for two w i , w j with same arrangement pattern weren't parallel. So they are actually parallel, and for all i ∈ I such that
D p = diag(1(Xw i ≥ 0)), ∥u p ∥ 2 = i∈I ∥w i ∥ 2 |α i | = 1 2 i∈I (∥w i ∥ 2 2 + α 2 i ),
because all w i are parallel for i ∈ I. The last equality follows from A ∈ Θ * (m). As m ≥ m * the optimization problems in equation 3 and equation 4 have same optimal values, which means Ψ(A) ∈ Θ * (m) and Φ(A) ∈ P * (m).
Moreover, we can see that the two mappings are similar to inverses of each other. Proposition D.9. Take any A ∈ P * can ∩ P * (m). Then, Φ(Ψ(A)) = A. Also, take any B = (w j , α j ) m j=1 ∈ Θ * min (m). Then, Ψ(Φ(B)) = (w σ(j) , α σ(j) ) m j=1 a permutation of B.
Proof. We know
Ψ((u i , v i ) P i=1 ) := ( u i ∥u i ∥ 2 , ∥u i ∥ 2 ) ui̸ =0 ⊕ ( v i ∥v i ∥ 2 , -∥v i ∥ 2 ) vi̸ =0 ⊕ (0, 0) m-card((ui,vi) P i=1 ) . Write Φ(Ψ((u i , v i ) P i=1 )) = (u ′ i , v ′ i ) P i=1 . Let s see that u ′ i = u i for all i ∈ [P ].
The case of v will follow similarly. The first case is when u i = 0. Say there exists u j ̸ = 0 and diag(1
(Xu j ≥ 0)) = D i . As (u i , v i ) P i=1 ∈ P * can , diag(1(Xu j ≥ 0)) = D j = D i , meaning i = j.
This is a contradiction because u i = 0. This means there is no u j ̸ = 0 that is 1(Xu j ≥ 0) = D i , and there is no
u j ̸ = 0 that is diag(1(Xu j / ∥u j ∥ 2 ≥ 0)) = D i , meaning u ′ i = 0. The next case is when u i ̸ = 0. For u j ̸ = 0 such that diag(1(Xu j ≥ 0)) = D i , the only possible j = i. For that j, we know that diag(1(Xu i / ∥u i ∥ 2 ≥ 0)) = D i , and u ′ i = u i / ∥u i ∥ 2 × ∥u i ∥ 2 = u i . This means u ′ i = u i for all i ∈ [P ], same for v, meaning Φ(Ψ((u i , v i ) P i=1 )) = (u i , v i ) P i=1 . Let's see Ψ • Φ. We know Φ((w i , α i ) m i=1 ) = (u i , v i ) P i=1 := u p = w i |α i | if α i > 0 and D p = diag(1(Xw i ≥ 0)), 0 otherwise v q = w i |α i | if α i < 0 and D q = diag(1(Xw i ≥ 0)), 0 otherwise, because (w i , α i ) m i=1 is minimal. Let's say m i=1 1(α i > 0) = m p , m i=1 1(α i = 0) = m z , m i=1 1(α i < 0) = m n .
In {u 1 , u 2 , • • • u P }, there will be m p nonzero vectors. Index them as u a1 , u a2 , • • • u am p . For u ai , we can find j i ∈ [m] that satisfies u ai = w ji |α ji |. Furthermore,
j i1 ̸ = j i2 if i 1 ̸ = i 2 because j i1 = j i2 means D ai 1 = D ai 2 and a i1 = a i2 , i 1 = i 2 . Similarly, define v b1 , v b2 , • • • v bm n and v bi = w ki |α ki |. Then, Ψ(Φ((w i , α i ) m i=1 )) = w ji |α ji | w ji |α ji | , w ji |α ji | mp i=1 ⊕ w ki |α ki | w ki |α ki | , -w ki |α ki | mn i=1 ⊕ (0, 0) mz .
First, we know that ∥w j ∥ 2 = |α j | for all j ∈ [m]. This leads to
Ψ(Φ((w i , α i ) m i=1 )) = (w ji , |α ji |) mp i=1 ⊕ (w ki , -|α ki |) mn i=1 ⊕ (0, 0) mz . As j i1 ̸ = j i2 if i 1 ̸ = i 2 , the result is a permutation of (w i , α i ) m i=1 .
At last, for these mappings to be meaningful, we would want them to be continuous. Luckily, Φ is continuous Proposition D.10. Proposition D.10. The map Φ : Θ * (m) → P * (m) is continuous.
Proof. We consider the sequence
(w k j , α k j ) m j=1 in Θ * (m) that converges to (w ∞ j , α ∞ j ) m j=1 ∈ Θ * (m). Let's write Φ((w k j , α k j ) m j=1 ) = (u k i , v k i ) P i=1 , Φ((w ∞ j , α ∞ j ) m j=1 ) = (u ∞ i , v ∞ i ) P i=1 . We will show that u k i → u ∞ i .
The rest will follow. As a starting point, we define some necessary constants. We define M j for j ∈ [m] that satisfy w ∞ j ̸ = 0 as the following:
if k ≥ M j , 1(Xw ∞ j ≥ 0) = 1(Xw k j ≥ 0) and α k j α ∞ j > 0.
Such M j exists due to the following reasoning: we know that for any solution A ∈ Θ * (m), there exists a finite set of possible directions for w j , which are the directions of ūi , vi in P * . As w ∞ j ̸ = 0 and w k j → w ∞ j , for sufficiently large k so that ∥w k j -w ∞ j ∥ 2 is sufficiently small, w ∞ j has to be a positive scaling of
w k j . Also w ∞ j ̸ = 0 implies α ∞ j ̸ = 0, meaning for sufficiently large k, α k j α ∞ j > 0 holds. For j ∈ [m] that has w ∞ j = 0, define N j (ϵ) to be the number that satisfies k ≥ N j (ϵ) implies ∥w k j α k j ∥ 2 ≤ ϵ.
Now we prove that for sufficiently large k,
∥u k i -u ∞ i ∥ 2 ≤ ϵ for all i ∈ [P ]. For a certain i ∈ [P ], suppose there exists {j 1 , j 2 , • • • j t } ⊆ [m] that satisfies D i = diag(1(Xw ∞ j1 ≥ 0)) = • • • = diag(1(Xw ∞ jt ≥ 0)) and α ∞ j1 , • • • , α ∞ jt > 0 (hence w ∞ j1 , • • • , w ∞ jt ̸ = 0). It is clear that u ∞ i = t i=1 w ∞ ji α ∞ ji . When k ≥ max{max w ∞ j =0 N j (ϵ/m), max w ∞ j ̸ =0 M j }, we know that 1(Xw k ji ≥ 0) = 1(Xw ∞ ji ≥ 0) and α k ji > 0 for i ∈ [t]. Also, for some j ∈ [m] which is not in {j 1 , j 2 , • • • , j t } and D i = diag(1(Xw k j ≥ 0)), w ∞ j = 0. Hence, u k i = t i=1 w k ji α k ji + w ∞ j =0,Di=diag(1(Xw k j ≥0)),α k j >0 w k j α k j . u k i → u ∞ i , as w k ji → w ∞ ji , α k ji → α ∞ ji for i ∈ [t]
and the rest sum becomes smaller than ϵ, hence converging to 0 as k → ∞.
Finally, let's see the case where there is no j ∈ [m] that satisfies
D i = diag(1(Xw ∞ j ≥ 0)) and α ∞ j > 0. Here, u ∞ i = 0. Now take k ≥ max{max w ∞ j =0 N j (ϵ/m), max w ∞ j ̸ =0 M j }. One thing to notice is for this k, if D i = diag(1(Xw k j ≥ 0)) and α k j > 0 for some j ∈ [m], w ∞ j = 0. Suppose w ∞ j ̸ = 0. As k ≥ M j , we know that D i = diag(1(Xw ∞ j ≥ 0)
) and α ∞ j > 0, which contradicts the assumption that there is no such j. Hence, when we write
u k i = w ∞ j =0,Di=diag(1(Xw k j ≥0)),α k j >0 w k j α k j , as k ≥ N j (ϵ/m), ∥u k i ∥ 2 ≤ ϵ. As u ∞ i = 0, we have that u k i → u ∞ i .
This finishes the proof.
However, Ψ may not be continuous. The intuition is that the solutions in the image of Ψ have zeros at the end, whereas the limit of Ψ((u i , v i ) P i=1 ) as u i → 0 may have zeros at the middle. Thus we have a slightly weaker notion of continuity for Ψ Proposition D.11. Proposition D.11. Consider a continuous path (u i (t), v i (t)) P i=1 in P * (m), where u i (t), v i (t) : [0, 1] → R d is either a zero map or a map can only have zero when t = 1. Consider the path ϕ(t) = Ψ((u i (t), v i (t)) P i=1 ) in Θ * (m). Then, ϕ(t) is continuous in [0, 1), and lim t→1 ϕ(t) is a permutation of ϕ(1).
Proof. Let's write what ϕ(t) looks like. Write i 1 < i 2 < • • • i p the indices where u i (0) ̸ = 0, and write j 1 < j 2 < • • • < j q the indices where v i (0) ̸ = 0. Denote the indices I, J . For t ∈ [0, 1), ϕ(t) is
ϕ(t) = u i (t) ∥u i (t)∥ 2 , ∥u i (t)∥ 2 i∈I ⊕ v i (t) ∥v i (t)∥ 2 , -∥v i (t)∥ 2 i∈J ⊕ (0, 0) m-p-q .
As I, J is fixed for t ∈ [0, 1) and
u i (t) ̸ = 0 if u i (0) ̸ = 0, v i (t) ̸ = 0 if v i (0) ̸ = 0, ϕ is continuous for [0,1
). When t = 1, lim t→1 ϕ(t) may have zeros in the middle, whereas ϕ(1) has zeros at the end, and the rest is the same. Hence, lim t→1 ϕ(t) is a permutation of ϕ(1).
Given the machinery, we are ready to elaborate the results. We start with proving that if m is sufficiently large, all permutations of a point A ∈ Θ * (m) are connected. The proof strategy is analogous to the proof in Simsek et al. (2021), where we create an empty slot to permute the weights. Proposition D.12. Suppose m ≥ M * + 1. Take any (w j , α j ) m j=1 ∈ Θ * (m). There exists a continuous path from (w j , α j ) m j=1 to an arbitrary permutation (w σ(j) , α σ(j) ) m j=1 .
Proof. Our proof will start from showing that for any A = (w j , α j ) m j=1 ∈ Θ * (m), we can find a continuous path
A ′ = (w ′ j , α ′ j ) m j=1 ∈ Θ * (m) that satisfies m j=1 1(α ′ j ̸ = 0) < m. First, use Proposition D.7 to find a continuous path from A to some A min = (w • j , α • j ) ∈ Θ * min (m). If m j=1 1(α • j ̸ = 0) < m, we have found such path. If not, let's show that {(Xw • j ) + } m j=1 is linearly dependent. As all α • j ̸ = 0, all w • j ̸ = 0. Now think of Φ(A min ) = (u i , v i ) P i=1 . We can easily see that {(Xw • j ) + α • j } m j=1 = {D i Xu i } ui̸ =0 ∪ {-D i Xv i } vi̸ =0 .
As the latter set has m > M * elements, it should be linearly dependent. If not, it is a contradiction to the fact that the maximal cardinality of the element in P * irr is M * . Hence, the set {(Xw
• j ) + α • j } m j=1
is linearly dependent, and as all α • j is nonzero, the set {(Xw • j ) + } m j=1 is linearly dependent. Now consider a nontrivial linear combination,
m i=1 c i (Xw • i ) + = 0.
Without loss of generality say α
• 1 c 1 < 0. Define t m = min α • i ci<0 - α • i c i ,
and for t ∈ [0, t m ] define wi (t) = w • i |α • i + tc i | ∥w • i ∥ 2 , αi (t) = ∥w • i ∥ 2 |α • i + tc i | sign(α • i ).
From the definition of t m , sign(α
• i + tc i ) = sign(α • i ) for t ∈ [0, t m ]. Also, m i=1 (X wi (t)) + αi (t) = m i=1 (Xw • i ) + (α • i + tc i ) = m i=1 (Xw • i ) + α • i ,and
1 2 m i=1 ∥ wi (t)∥ 2 2 + |α i (t)| 2 = m i=1 ∥ wi (t)∥ 2 |α i (t)| = m i=1 ∥w • i ∥ 2 |α • i | + ∥w • i ∥ 2 tc i sign(α • i ).
At last, we know that for the dual optimum ν * defined in Theorem 1,
(ν * ) T (Xw • j ) + = -β∥w • j ∥ 2 sign(α • j ), for all j ∈ [m]
. This is obtained by using the fact that Φ(A min ) ∈ P * . Hence, if
m i=1 c i (Xw • i ) + = 0, multiplying (ν * ) T on both sides leads m i=1 ∥w • i ∥ 2 tc i sign(α • i ) = 0,and
1 2 m i=1 ∥ wi (t)∥ 2 2 + |α i (t)| 2 = m i=1 ∥w • i ∥ 2 |α • i | = 1 2 m i=1 ∥w • i ∥ 2 2 + |α • i | 2 .
Hence the objective is preserved throughout the curve, meaning the curve is in Θ * (m). The cardinality decreased by at least 1 at the end due to the definition of t m . Now that we can find a continuous path from
A to A ′ = (w ′ j , α ′ j ) m j=1 ∈ Θ * (m) where m j=1 1(α ′ j ̸ = 0) < m, we will find a path from A ′ to any permutation of A ′ , namely (w ′ σ(j) , α ′ σ(j) ) m j=1 for some permutation σ : [m] → [m].
A simple path construction is as follows: we know that at least one α ′ i = 0. Let that i = m without loss of generality. Starting from i 0 = 1, we do the following: if
w ′ i0 = w ′ σ(i0) , we do nothing. If w ′ i0 ̸ = w ′ σ(i0) , we first write w ′ i0 (t) = w ′ i0 √ 1 -t, α i0 (t) = α ′ i0 √ 1 -t, w ′ m (t) = w ′ i0 √ t, α ′ m (t) = α ′ i0 √ t,
for t ∈ [0, 1], which intuitively 'moves' w ′ i0 to the empty space w m and making
w ′ i0 = 0. Next we move w ′ σ(i0) to w ′ i with w ′ i0 (t) = w ′ σ(i0) √ t, α i0 (t) = α ′ σ(i0) √ t, w ′ σ(i0) (t) = w ′ σ(i0) √ 1 -t, α ′ σ(i0) (t) = α ′ σ(i0) √ 1 -t,
for t ∈ [0, 1], which intuitively 'moves' w ′ σ(i0) to the empty space w i0 and making w σ(i0) = 0. At last, we make w m empty by using
w ′ σ(i0) (t) = w ′ i0 √ t, α σ(i0) (t) = α ′ i0 √ t, w ′ m (t) = w ′ i0 √ 1 -t, α ′ m (t) = α ′ i0 √ 1 -t.
To wrap up, we may swap the element in (w i , α i ) and (w σ(i) , α σ(i) ) by first moving w i to w m , then moving w σ (i) to w i , and at last moving w m to w σ(i) .
Until here we connected A = (w j , α j ) m j=1 with A ′ , and then A ′ with a permutation of A ′ . To connect A with (w σ(j) , α σ(j) ), simply run the path A → A ′ backwards to obtain (w σ(j) , α σ(j) ) m j=1 .
Proposition D.12 enables us to connect two different permutations. Even though Ψ is not essentially continuous, the fact that two permutations are connected will allow us to construct paths in Θ * (m) from paths in P * (m). Proposition D.13. Suppose m ≥ M * + 1. If any two points A, B ∈ P * (m) are connected with a path with finite cardinality changes, Θ * (m) is connected.
Proof. Take two points A, B ∈ Θ * (m). First use Proposition D.7 to find path from A to A min ∈ Θ * min (m) and B to B min ∈ Θ * min (m). Our main goal will be connecting A min and B min by using the path from Φ(A min ) to Φ(B min ).

Section: Consider the continuous path from
Φ(A min ) to Φ(B min ), namely f : [0, 1] → P * (m) satisfying f (0) = Φ(A min ), f (1) = Φ(B min ). Write f (t) = (u i (t), v i (t)) P i=1 . Divide [0, 1] to times (t 0 = 0, t 1 ), (t 1 , t 2 ) • • • (t k-1 , t k = 1)
, where in each time interval either each u i , v i are either always zero or always nonzero. We have finitely many such t i s because we assume that the cardinality change is finite. From Proposition D.11, we can see that the path Ψ • f (t) is continuous at each interval. However, as we saw in Proposition D.11,
Ψ • f (t i ), lim t→t - i Ψ • f (t), lim t→t + i Ψ • f (t) are all permutations of each other. We construct a path from Ψ • f (0) to Ψ • f (1) as following: First, for each p = 0, 1, • • • , k -1, construct a path from lim t→t + p Ψ • f (t) to lim t→t - p+1 Ψ • f (t) by defining g(t) =      lim t→t + p Ψ • f (t) if t = t p Ψ • f (t) if t ∈ (t p , t p+1 ) lim t→t - p+1 Ψ • f (t) if t = t p+1 , for t ∈ [t p , t p+1 ]. It is clear that g is continuous. Moreover, we can connect each Ψ • f (t p ) with lim t→t + p Ψ • f (t) and lim t→t - p Ψ • f (t)
, because from Proposition D.12, we know that when m ≥ M * + 1, two permutations are connected in Θ * (m), and from Proposition D.11, from this construction, a one-sided limit is a permutation of the image. Hence, a connection from Ψ • f (0) to Ψ • f (1) is possible, by connecting permutations at boundaries of each interval t 0 , t 1 , • • • , t k , and moving with Ψ • f inside the intervals. Hence, Ψ • Φ(A min ) and Ψ • Φ(B min ) are connected. From Proposition D.9, we know that Ψ • Φ(A min ) is a permutation of A min , and as m ≥ M * + 1 we know that they are connected. Same holds for Ψ • Φ(B min ). This means that we have found a continuous path from A min to B min . At the beginning we connected A with A min , B with B min , which finishes the proof.
For discontinuity, we use the property of isolated points A ∈ P * can ∩ P * (m) with cardinality m. Proposition D.14. Suppose m ≥ m * . If A ∈ P * (m) ∩ P * can is an isolated point in P * (m) and has
card(A) = m, Ψ(A) is an isolated point in Θ * (m). Proof. Assume the existence of a continuous function f : [0, 1] → Θ * (m) that satisfies Ψ(A) = f (0), Φ • f (1) ̸ = A. Consider the path Φ • f (t) in P * (m). As A ∈ P * (m) ∩ P * can , Φ(Ψ(A)) = A ̸ = Φ • f (1), which is a contradiction that A is an isolated point in P * (m). Hence, Ψ(A) does not have a path into Θ * (m) -Φ -1 (A).
To finish the proof, we show that Φ -1 (A) is finite. If Φ -1 (A) is finite, we will not be able to move from Ψ(A) to a point in Φ -1 (A) -{Ψ(A)} only using points in Φ -1 (A) -{Ψ(A)}, meaning we do not have a path from
Ψ(A) to Θ * (m) -{Ψ(A)}, proving our claim. Suppose Φ((w i , α i ) m i=1 ) = A. If diag(1(Xw i ≥ 0)) = diag(1(Xw j ≥ 0)) = D p
for some α i α j > 0, the two indices i and j will either correspond to the same u p or v p , and card(Φ((
w i , α i ) m i=1 )) < m, which is a contradiction. Hence, (w i , α i ) m i=1 ∈ Θ * min (m).
From Proposition D.9, we know that
Ψ(Φ((w i , α i ) m i=1 )) = Ψ(A) is a permutation of (w i , α i ) m i=1 . This means Φ -1 (A) is contained in a set of permutation of Ψ(A), which is finite.
Proposition D.13 and Proposition D.14 enables us to discuss connectivity of Θ * (m) with the connectivity of P * (m). With the results that we obtained for P * (m), more specifically Proposition D.2, Proposition D.3, Proposition D.5, Theorem D.1, and applying Proposition D.13 and Proposition D.14 appropriately, we arrive at the staircase of connectivity defined in Theorem 2. Theorem D.2. (Theorem 2 of the paper) (The staircase of connectivity) Denote the optimal solution set of equation 3 in parameter space as Θ * (m) ⊆ R (d+1)m . Suppose L is a strictly convex loss function and there exists (w i , α i ) m i=1 ̸ = (0, 0) m i=1 ∈ Θ * (m) for some m. Let m * , M * be two critical values defined in Theorem 2. As m changes, we have that when (i) m = m * , Θ * (m) is a finite set. Hence, for any two optimal points A ̸ = A ′ ∈ Θ * (m), there is no path from A to A ′ inside Θ * (m).
(ii) m ≥ m * + 1, there exists optimal points A, A ′ ∈ Θ * (m) and a path in Θ * (m) connecting them.
(iii) m = M * , Θ * (m) is not a connected set. Moreover, there exists A ∈ Θ * (m) which is an isolated point, i.e. there is no path in Θ * (m) that connects A with
A ′ ̸ = A ∈ Θ * (m).
(iv) m ≥ M * + 1, permutations of the solution are connected. Hence, for all A ∈ Θ * (m), there exists A ′ ̸ = A in Θ * (m) and a path in Θ * (m) that connects A and A ′ .
(v) m ≥ min{m * + M * , n + 1}, the set Θ * (m) is connected, i.e. for any two optimal points A ̸ = A ′ ∈ Θ * (m), there exists a continuous path from A to A ′ .
Proof. Proof of i) starts by observing that
A ∈ Θ * min (m * ) if A ∈ Θ * (m * ).
If not, we can find a solution A ′ ∈ Θ * min (m * -1) using Proposition D.7, and its image Φ(A ′ ) ∈ P * (m * -1). At last, Φ(A ′ ) is connected with a point in P * (m * -1) ∩ P * irr using Proposition D.4, which contradicts the minimality of m * . Now, for any A ∈ Θ * (m * ), Φ(A) = B ∈ P * (m * ) satisfies that Ψ(B) is a permutation of A. Hence, Θ * (m * ) is contained in the set of permutations of Ψ(P * (m * )), and as E PROOFS IN SECTION 3.3 In this section, we give the examples constructed in Section 3.3 and their rigorous proof.
The specific examples we present are the following: Proposition E.1. Suppose n = 3, input data is given as
{(x 1i , x 2i , y i )} 3 i=1 = {(1, 0, 1/6), (-1/2, √ 3/2, 2/3), (-1/2, - √ 3/2, 1/6)}, X = [x 1 x 2 ] ∈ R 3×2 .
Then, the minimization problem in equation 8 with free skip connections and without bias, namely the SNB problem, has at least two different solutions in F * .
Proof. Let's consider the set
Q X = {(Xu) + | ∥u∥ 2 ≤ 1}.
The six possible hyperplane arrangement patterns 1(Xu ≥ 0) are (001), ( 010), ( 100), ( 011), ( 101), ( 110), and when we draw the set Q X we get the following shape in Figure 7.  
(Xw i ) + α i = y, ∥w i ∥ 2 ≤ 1 ∀i ∈ [m].
Now, choose ν = [1, 1, 1] T . For any optimal w 0 , w 1 , w 2 , α 0 , α 1 , we know that
ν T Xw 0 + m i=1 ν T (Xw i ) + α i = m i=1 ν T (Xw i ) + α i = ⟨ν, y⟩,and
⟨ν, y⟩ ≤ m i=1 |ν T (Xw i ) + ||α i | ≤ m i=1 |α i |,
which means that the objective value is lower bounded by ⟨ν, y⟩ = 1. At last, we have two different models that have different breaklines and have objective value 1. The two models are:
w 0 = - 1 3 1 0 , w 1 = 1 √ 2 1 0 , w 2 = 1 √ 2 -1/2 √ 3/2 , α 1 = 1 √ 2 , α 2 = 1 √ 2 ,
and
w 0 = - 1 3 -1/2 - √ 3/2 , w 1 = 1 √ 2 -1/2 - √ 3/2 , w 2 = 1 √ 2 -1/2 √ 3/2 , α 1 = 1 √ 2 , α 2 = 1 √ 2 .
With direct substitution, we can see that they are both valid interpolators. A direct calculation shows that both have objective value of 1. At last, the breaklines differ, as the breaklines directly correspond to the weight vectors w 1 , w 2 : this means that the two optimal functions are different.
Proposition E.2. Suppose n = 4, input data is given as {(x 1i , x 2i , y i )} 4 i=1 = {(1, 0, 1), (0, 1, -1), (-1, 0, 1), (0, -1, -1)}, X = [x 1 x 2 ] ∈ R n×2 . Then, the minimization problem in equation 8, namely the SB problem, has at least two different solutions in F * .
Proof. We give a similar proof strategy as that in Proposition E.1, writing X = [X | 1] ∈ R n×3 and bounding |ν T ( Xu) + | for ν = [1, -1, 1, -1] T and ∥u∥ ≤ 1. Note that XT ν = 0. It is not easy to visualize the shape of {( Xu) + |∥u∥ 2 ≤ 1} as in Figure 7, but we can solve the optimization problem by splitting the input domain into 14 regions where the function ( Xu) + is linear. There are 14 such regions due to the classical result of Cover (1965). A simple convex optimization for these 14 liner regions yields that |ν T ( Xu) + | ≤ 1 ∀∥u∥ 2 ≤ 1. Another way to see this is writing u = [a, b, c] and ν T ( Xu) + as
1 2 (|a + c| + |a -c| -|b + c| -|b -c|),
and with the constraint a 2 + b 2 + c 2 ≤ 1, we have that the above formula is bounded between -1 and 1. Using the same scaling trick, we see that the lower bound of the objective is ⟨ν, y⟩ as in Proposition E.1, which is 4. At last, choose two models as
w 0 = 0 0 1 , w 1 =   0 √ 2 0   , w 2 =   0 - √ 2 0   , α 1 = - √ 2, α 2 = - √ 2,
and
w 0 = 0 0 -1 , w 1 =   √ 2 0 0   , w 2 =   - √ 2 0 0   , α 1 = √ 2, α 2 = √ 2.
Both solutions have cost 4 and interpolates the data. With some simplification, we can see that the first solution gives f (x, y) = 1-2(y) + -2(-y) + , whereas the second solution gives f (x, y) = -1+ 2(x) + + 2(-x) + . Apparently we have two different minimum-norm interpolators. A visualization (and the symmetry behind it) can be found in Figure 8. 
-1 0 1 -1 0 1 -2 0 (a) Interpolator f (x, y) = 1 -2(y)+ -2(-y)+ -1 0 1 -1 0 1 0 2 (b) Interpolator f (x, y) = -1 + 2(x)+ + 2(-x)+
∥w i ∥ 2 2 + α 2 i , subject to Xu + m i=1 (Xw i ) + α i = y,
where for input X in , X = X in in the unbiased case and X = [X in |1] in the biased case. Note that m is also an optimization variable that is not fixed. When the input is 2-dimensional, we have a non-unique minimum-norm interpolator regardless of bias.
Proof. This is a direct consequence of Proposition E.1 and Proposition E.2.
Proposition E.4. (Proposition 3 of the paper) Consider the minimum-norm interpolation problem without free skip connections and regularized bias, namely the NSB problem. Take n vectors in R 2 that satisfy
v n = [ √ 3 2 , 1 2 ] T , ∥s k ∥ 2 = 1, s k > 0 ∀k ∈ [n -1], s n = [0, 1] T , v i,2 > 0 ∀i ∈ [n].
where we write
k i=1 v n-i+1 = s k . Now, choose x i = v i,1 /v i,2
and choose y as any conic combination of n + 1 vectors
(s i,1 x + s i,2 1) + ∀i ∈ [n], ((s n -v n ) 1 x + (s n -v n ) 2 1) + ,
with positive weights. Then, there exist infinitely many minimum-norm interpolators.
Proof. Let's choose x as proposed in Proposition 3. Also, let's choose ν ∈ R n as
ν i = v i,2 . Write X = [x|1] ∈ R n×2 . Then the NSB problem is written as min m,{wi,αi} m i=1 m i=1 ∥w i ∥ 2 2 + |α i | 2 subject to m i=1 ( Xw i ) + α i = y.
We first show that max
∥u∥2≤1 |ν T ( Xu) + | = 1,(19)
and the solutions are
s 1 , s 2 , • • • s n , s n -v n .
The first thing to observe is that
x 1 < x 2 < • • • < x n .
To prove this, let's see that for i = 2, 3,
• • • n -1, x i-1 < - s n-i+1,2 s n-i+1,1 < x i .
The reason is the following: for i = 2, 3, • • • n -1, we know ∥s n-i+1 + v i-1 ∥ 2 = 1, and as
∥s n-i+1 ∥ 2 = 1 we know s n-i+1 • v i-1 = -1/2 • ∥v i-1 ∥ 2 2 < 0. Hence, s n-i+1,1 v i-1,1 + s n-i+1,2 v i-1,2 < 0, and as s n-i+1,1 , s n-i+1,2 , v i-1,2 > 0, we have s n-i+1,2 /s n-i+1,1 < -v i-1,1 /v i-1,2 = -x i-1 . Similarly, ∥s n-i+1 -v i ∥ 2 = 1, and as s n-i+1 • v i > 0, we have s n-i+1,2 /s n-i+1,1 > -v i,1 /v i,2 = -x i . This means for i = 2, 3, • • • n -1, x i-1 < x i , and x 1 < x 2 < • • • < x n-1 . At last, we have v n-1 • v n < 0 because ∥v n ∥ 2 = ∥v n + v n-1 ∥ 2 = 1, meaning x n-1 < 0, whereas x n = √ 3 > 0, meaning x 1 < x 2 < • • • < x n .
Now we consider the possible arrangement patterns diag(1( Xu ≥ 0)). We can see that the possible patterns are diag([0, 0,
• • • , 0]), diag([0, 0, • • • , 0, 1]), diag([0, 0, • • • , 1, 1]), • • • diag([0, 1, • • • , 1, 1]), diag([1, 1, • • • , 1, 1]), • • • diag([1, 0, • • • , 0, 0]).
In other words, starting from the n-th entry, 0 turns to 1 in reverse order, then we have all ones, then 1s become 0s at starting from the n-th entry. Let's denote the diagonal matrices
D 1 , D 2 , • • • D 2n . Solving equation 19 is equivalent to solving max (2Di-I) Xu≥0, ∥u∥2≤1 ν T D i Xu.
The absolute value function is erased as ν > 0.
For D 1 , the objective is 0. For D 2 to D n+2 , we first know that ∥ν T D i X∥ 2 = 1. To see this, observe that ∥ν
T D 2 X∥ 2 = ∥ν n [x n , 1]∥ 2 = ∥v n ∥ 2 = 1, ∥ν T D 3 X∥ 2 = ∥ν n [x n , 1] + ν n-1 [x n-1 , 1]∥ 2 = ∥v n + v n-1 ∥ 2 = 1, • • • , ∥ν T D n+1 X∥ 2 = ∥ n i=1 v i ∥ 2 = 1, ∥ν T D n+2 X∥ 2 = ∥ n-1 i=1 v i ∥ 2 = ∥[- √ 3 2 , 1 2 ]∥ 2 = 1. For D n+3 to D 2n , we can also see that ∥ν T D i X∥ 2 < 1. That is because ν T D n+k X = s n -s k-1 for k ≥ 2. ∥s n -s k-1 ∥ 2 = 2 -2s n • s k-1 , and as we know 1/2 = s n • s 1 < s n • s 2 < • • • s n • s n-1 (
which is the sum of v i,2 , and as v i,2 > 0 we have the property),
∥s n -s k-1 ∥ 2 < 1 for k = 3, 4, • • • n.
We can then see that max (2Di-I) Xu≥0, ∥u∥2≤1 ν
T D i Xu ≤ max i∈[2n] ∥ν T D i X∥ 2 = 1. The last thing to check is that s 1 , s 2 , • • • s n , s n -v n are actual solutions. For i = 2, 3, • • • n -1, we know that x 1 < x 2 < • • • < x i-1 < - s n-i+1,2 s n-i+1,1 < x i < x i+1 < • • • < x n ,
and when we write
k i = [x i , 1] T , we have that k n • s n-i+1 > 0, • • • k i • s n-i+1 > 0, k i-1 • s n-i+1 < 0, • • • k 1 • • • s n-i+1 < 0 for all i = 2, 3, • • • n -1. Hence, for s 2 , s 3 , • • • s n-1 , (2D i+1 -I) Xs i ≥ 0.
As s i = XT D i+1 ν, for these s i s, ν T ( Xs i ) + = 1.
The three cases we have to check are when i = 1, n, and s n -v n . When i = n, s n = [0, 1] T and as all k i s have positive y values, s n • k i > 0 for all i ∈ [n] and indeed, s n becomes a solution. Also, we know that
∥s 1 + v n-1 ∥ 2 = 1, meaning x 1 < x 2 < • • • < x n-1 < -s 1,2 /s 1,1 . Hence, k n-1 • s 1 < 0, • • • k 1 • s 1 < 0.
As s 1 and k n are parallel, we know k n • s 1 > 0. Same with other cases, as
∥s 1 ∥ 2 = 1, s 1 is a solution. At last, let's check that (2D n+2 -I) X(s n -v n ) ≥ 0. As v n = [ √ 3/2, 1/2] T , s n -v n = [- √ 3/2, 1/2] T , v n • (s n -v n ) < 0 and k n • (s n -v n ) < 0. For i ∈ [n -1], we know x i < 0. Hence, - √ 3x i + 1 > 0.
This means (2D n+2 -I) X(s n -v n ) > 0 and as ∥s n -v n ∥ 2 = 1, we have s n -v n as a solution too. Now that we have found n + 1 different solutions to problem in equation 19, let's note them w 1 , w 2 , • • • w n+1 . y is chosen as any conic combination
y = n+1 i=1 c i ( Xw i ) + ,(20)
where c i > 0. We know that the interpolation problem is equivalent to
min t subject to y ∈ tConv(Q X ∪ -Q X ),
where Ergen (2020). In other words, the minimum-norm interpolation problem without free skip connections and regularized bias is equivalent to
Q X = {( Xu) + | ∥u∥ 2 ≤ 1} Pilanci &
min m,(zi,di) m i=1 m i=1 |d i |, y = m i=1 d i ( Xz i ) + , for some ∥z i ∥ 2 ≤ 1, i ∈ [m]. For any d i , z i that satisfies ∥z i ∥ 2 ≤ 1 and y = m i=1 d i ( Xz i ) + ,
we have that
⟨ν, y⟩ = n+1 i=1 c i = m i=1 d i ν T ( Xz i ) + ≤ m i=1 |d i ν T ( Xz i ) + | ≤ m i=1 |d i |,
meaning ⟨ν, y⟩ is the optimal value, and any conic combination of {( Xw i ) + } n+1 i=1 yields a solution.
Let's write w i = [a i , b i ] T . The optimal interpolator then becomes
f ( X) = n+1 i=1 c i (a i x + b i ) + ,
for c i s satisfying equation 20. The last thing to check is that there are infinitely many such interpolators. This is slightly different from y having infinitely many different conic combination expressions of {( Xw i ) + } n+1 i=1 , because different c i may correspond to the same function.
As a final step, we show that we indeed have infinitely many different interpolators. Recall that
{s 1 , s 2 , • • • s n , s n -v n } = {w 1 , w 2 , • • • w n+1 }. Note that s 1,1 , s 2,1 , • • • s n,1 ≥ 0 and s n,1 -v n,1 < 0.
Without loss of generality let s i = w i for i ∈ [n] and s n -v n = w n+1 . As x → -∞, the slope will be c n+1 a n+1 . Showing the different conic representations of y have different c n+1 values is enough. The interesting observation is that the vectors {( Xw i ) + } n i=1 is actually linearly independent, because each vector has D 2 , D 3 , • • • D n+1 as arrangement patterns and for each s 1 , s 2 , • • • s n , strict inequality holds. Hence, for each different conic combination of y, c n+1 should be different. This means the slope at x → -∞ is different, and we indeed have infinitely many optimal interpolators.

Section: F PROOFS IN SECTION 4
In this section, we prove how our results can be generalized to different architectures and setups. We begin by describing a general solution set. Theorem F.1. Consider the cone-constrained group LASSO with regularization R i , min θi∈Ci∩Vi,si∈Di L(
P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R i (θ i ). (21
) Assume θ i , s j ∈ R d , A i , B j ∈ R n×d , R i : V i → R are norms, C i , D j ⊆ R d are proper cones for i ∈ [P ], j ∈ [Q], C i ∩ V i ̸ = ∅ for i ∈ [P ]
and β > 0. The optimal set P * gen is given as
P * gen = (c i θi ) P i=1 ⊕ (s i ) Q i=1 | c i ≥ 0, P i=1 c i A i θi + Q i=1 B i s i ∈ C y , θi ∈ Zer(F (S i , A T i ν * , -β, ⟨, ⟩)), ⟨B T i ν * , s i ⟩ = 0, s i ∈ D i ,(22)
where ν * is any vector that minimizes f * (ν) subject to the constraint
min u∈Ci∩Vi ⟨A T i ν, u⟩ + βR i (u) = 0, min s∈Dj ⟨B T j ν, s⟩ = 0, for all i ∈ [P ], j ∈ [Q], F (S, v, -β, ⟨, ⟩) = {u | u ∈ S, ⟨v, u⟩ = -β}, S i = C i ∩ {u | R i (u) ≤ 1}
and Zer(S) = {0} if S = ∅, S otherwise. Here, f (•) = L(•, y) and f * denotes the Fenchel conjugate of f .
Proof. Suppose the optimal set of the problem in equation 21 is Θ * gen . We show that Θ * gen ⊆ P * gen and vice versa. Suppose (θ * , s
* ) = (θ * i ) P i=1 ⊕ (s * i ) Q i=1 ∈ Θ * gen . We know that P i=1 A i θ * i + Q i=1 B i s * i ∈ C y , hence it satisfies the second condition for w * = P i=1 A i θ * i + Q i=1 B i s * i . Also, consider the convex optimization problem min w,θi∈Ci∩Vi,si∈Di L(w, y) + β P i=1 R(θ i ) subject to P i=1 A i θ i + Q i=1 B i s i = w, and its Lagrangian L(w, θ, s, ν) = L(w, y) -ν T w + P i=1 (⟨A T i ν, θ i ⟩ + βR i (θ i )) + Q i=1 ⟨B T i ν, s i ⟩.(23)
The strong duality argument is essentially the same as that with the proof in Theorem 1. Moreover, for the dual problem max ν min w,θi∈Ci∩Vi,si∈Di L(w, θ, s, ν), if min u∈Ci∩Vi ⟨A T i ν, u⟩ + βR i (u) < 0, we can scale u infinitely large to attain the minimum -∞. Same holds when min u∈Di ⟨B T i ν, u⟩ < 0. Hence, these cases cannot maximize the dual objective, and the dual problem can be written as max
min u∈C i ∩V i ⟨A T i ν,u⟩+βRi(u)=0 min u∈D i ⟨B T i ν,u⟩=0 min w L(w, y) -ν T w = max min u∈C i ∩V i ⟨A T i ν,u⟩+βRi(u)=0 min u∈D i ⟨B T i ν,u⟩=0 -f * (ν),
meaning ν * is the dual optimal point. When strong duality holds, for any primal optimal point (w * , θ * , s * ) and the dual optimal point ν * , the Lagrangian L(w, θ, s, ν * ) attains minimum at (w * , θ * , s * ). Substitute ν * in equation 23 to see that each θ * i is a minimizer of the problem min ⟨A T i ν * , u⟩ + βR(u) subject to u ∈ C i ∩ V i . One thing to notice is that the value ⟨A T i ν * , θ * i ⟩ + βR i (θ * i ) = 0, because if it is strictly smaller than 0 we can strictly decrease the objective ⟨A T i ν * , u⟩ + βR i (u) with u = 2θ * i .
Published as a conference paper at ICLR 2025
If θ * i = 0, we can choose c i = 0 to find c i , θi ∈ Zer(F (S i , A T i ν * , -β)). If θ * i ̸ = 0, we know that R i (θ * i ) ̸ = 0, and the vector θ * i /R i (θ * i ) satisfies θ * i /R i (θ * i ) ∈ S i and ⟨A T i ν * , θ * i /R i (θ * i )⟩ = -β. Choose c i = R i (θ * i ), θi = θ * i /R i (θ * i ) to find c i , θi ∈ Zer(F (S i , A T i ν * , -β))
. For s * i s, we know that each s * i s are the minimizer of the problem min⟨B T i ν * , u⟩ subject to u ∈ D i , hence it should be in D i and the value ⟨B T i ν * , s * i ⟩ = 0.
Concluding, for any (θ * , s * ), clearly
P i=1 A i θ * i + Q j=1 B j s * j ∈ C y and s * i ∈ D i , ⟨B T i ν * , s * i ⟩ = 0, choose c i = 0 when θ * i = 0, c i = R(θ * i ), θi = θ * i /R(θ * i )
otherwise to see that (θ * , s * ) ∈ P * gen , and Θ * gen ⊆ P * gen . Now, let's take an element (θ, s) ∈ P * gen . We know that θ
∈ C i ∩ V i and s ∈ D i . If θi ̸ = 0, we know that ⟨ν * , A i θi ⟩ = -β as θi ∈ F (S i , A T i ν * , -β). Moreover, θi is the solution to min u∈Ci∩Vi,Ri(u)≤1 ⟨A T i ν * , u⟩,
because for all u ∈ S i , ⟨A T i ν * , u⟩ ≥ -βR i (u) ≥ -β holds. This means R i ( θi ) = 1 for all θi ̸ = 0, as the minimum will be attained at a nonzero point, hence the boundary where R
i (u) = 1. Using ⟨ν * , A i θi ⟩ = -β and ⟨B T i ν * , s i ⟩ = 0, we get ⟨ν * , w ′ ⟩ = ⟨ν * , P i=1 c i A i θi + Q i=1 B i s i ⟩ = -β θi̸ =0 c i ,
for some w ′ ∈ C y . On the other hand, from R i ( θi ) = 1 for all i ∈ [P ], we know that
P i=1 R i (c i θi ) = θi̸ =0 c i .
This leads to the fact that for (θ, s), L(
P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R i (θ i ) = L(w ′ , y) + β θi̸ =0 c i = L(w ′ , y) -⟨ν * , w ′ ⟩.
At last, we show that for all w ′ ∈ C y ,
L(w ′ , y) -⟨ν * , w ′ ⟩ = min θi∈Ci∩Vi,si∈Di L( P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R i (θ i ).
The fact follows when we use the fact that for (θ ′ , s ′ ) ∈ Θ * gen that satisfies w
′ = P i=1 A i θ ′ i + Q i=1 B i s ′ i , the point (w ′ , θ ′ , s ′ ) becomes a minimizer of L(w, θ, s, ν * ). Hence, each minimizer θ ′ i is a minimizer of the problem min⟨A T i ν * , u⟩ + βR i (u) subject to u ∈ C i ∩ V i , which means that βR i (θ ′ i ) = -⟨ν * , A i θ ′ i ⟩ for all i ∈ [P ], as ν * satisfies min u∈Ci∩Vi ⟨A T i ν * , u⟩ + βR i (u) = 0.
Also, ⟨ν * , B i s ′ i ⟩ = 0 as s ′ i minimizes ⟨B T i ν * , s⟩ subject to s ∈ D i , and we see that
β P i=1 R i (θ ′ i ) = -⟨ν * , w ′ ⟩,and
min θi∈Ci∩Vi,si∈Di L( P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R(θ i ) = L( P i=1 A i θ ′ i + Q i=1 B i s ′ i , y) + β P i=1 R(θ ′ i ) = L(w ′ , y) -⟨ν * , w ′ ⟩, meaning (θ, s) ∈ Θ * gen because L( P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R(θ i ) = L(w ′ , y) -⟨ν * , w ′ ⟩.
This means P * gen ⊆ Θ * gen , and finishes the proof.
One application of the theorem is characterizing the optimal set of the interpolation problem. This leads to the staircase of connectivity for interpolation problems. Proposition F.1. The solution set of the optimization problem
min ui,vi∈Ki P i=1 ∥u i ∥ 2 + ∥v i ∥ 2 subject to P i=1 D i X(u i -v i ) = y,
is given as 
P * := (c i ūi , d i vi ) P i=1 | c i , d i ≥ 0 ∀i ∈ [P ], P i=1 D i X ūi c i -D i X vi d i = y ⊆ R 2dP ,where
D i Xu| ≤ ∥u∥ 2 ∀u ∈ K i , i ∈ [P ].
Here,
S i = K i ∩ {u | ∥u∥ 2 ≤ 1}.
Proof. Let's apply Theorem F.1 to the problem. Note that we can set β = 1. In fact, β can be arbitrary, and scaling ν * to make β = 1 will lead to the same result.
When we apply Theorem F.1, we have that
P * gen = (c i ūi , d i vi ) m i=1 | c i , d i ≥ 0, P i=1 D i X ūi c i -D i X vi d i = y, ūi ∈ Zer(F (S i , X T D i ν * , -1)), vi ∈ Zer(F (S i , -X T D i ν * , -1)) ,
where ν * is the dual optimum that minimizes L * (•, y) subject to the constraint
min u∈Ki ⟨X T D i ν, u⟩ + ∥u∥ 2 = 0, min u∈Ki ⟨-X T D i ν, u⟩ + ∥u∥ 2 = 0,(24)
for all i ∈ [P ]. We know that L * (ν) = ⟨ν, y⟩, and equation 24 can be rewritten to |ν T D i Xu| ≤ ∥u∥ 2 . Also, F (S i , X T D i ν * , -1) = {0} if there is no u such that ν * T D i Xu = -1, and is exactly that vector if exists. Note that as min u∈Ki ⟨X T D i ν, u⟩ + ∥u∥ 2 = 0, we have a unique minimum for the optimal direction Proposition C.2. Same holds for vi . (ii) m ≥ m * + 1, there exists optimal points A, A ′ ∈ Θ * (m) and a path in Θ * (m) connecting them.
(iii) m = M * , Θ * (m) is not a connected set. Moreover, there exists A ∈ Θ * (m) which is an isolated point, i.e. there is no path in Θ * (m) that connects A with A ′ ̸ = A ∈ Θ * (m). , because the description of the optimal polytope is identical except for which directions the solutions are fixed at. The same solution map can be applied because it preserves both the fit and the regularization. The continuity is preserved, and we have Proposition D.12, Proposition D.13, Proposition D.14. The mapping in Proposition D.7 can also be applied here.
Another implication of the theorem is that for free skip connections, the dual variable has to satisfy X T ν = 0. The existence of free skip connections constrain freedom on ν, which brings qualitative difference to the uniqueness of the solution set. Proposition F.3. The solution set of the optimization problem
min ui,vi∈Ki P i=1 ∥u i ∥ 2 + ∥v i ∥ 2 subject to Xu 0 + P i=1 D i X(u i -v i ) = y,
is given as
P * := u 0 ⊕ (c i ūi , d i vi ) P i=1 | c i , d i ≥ 0 ∀i ∈ [P ], Xu 0 + P i=1 D i X ūi c i -D i X vi d i = y ⊆ R 2dP ,
where ūi , vi are fixed directions found by solving the optimization problem
ūi = arg min u∈Si ν * T D i Xu if min u∈Si ν * T D i Xu = -1, 0 otherwise, vi = arg min v∈Si -ν * T D i Xv if min v∈Si -ν * T D i Xv = -1, 0 otherwise.
where ν * is the dual optimum that satisfies ν * = arg min⟨ν, y⟩ subject to |ν
T D i Xu| ≤ ∥u∥ 2 ∀u ∈ K i , i ∈ [P ], X T ν = 0. Here, S i = K i ∩ {u | ∥u∥ 2 ≤ 1}. When we use block notation (ν * ) T = [(ν * 1 ) T (ν * 2 ) T • • • (ν * c ) T ], u T = [(u 1 ) T , (u 2 ) T , • • • , (u c ) T ] for ν * ∈ R nc , u ∈ R dc , we can see that (ν * ) T A i u = c j=1 (ν * j ) T D i Xu j = ⟨F l -1 nc (ν * ), D i XF l -1 dc (u)⟩ M ,
using notations for matrix inner product. Hence, we can see that
F l -1 dc (F(F l dc (K i ), A T i ν * , -β, ⟨, ⟩ M )) = K i ∩ {U | ⟨F l -1 nc (ν * ), D i XU ⟩ = -β} = F(K i , X T D i N * , -β, ⟨, ⟩ M ),
and for (θ i ) P i=1 ∈ P * flat , F l -1 dc (θ i ) satisfies equation 26 and we also have the fact that
F l -1 dc ( θi ) ∈F(K i , X T D i N * , -β, ⟨, ⟩ M ).
Hence we arrive at the desired result. Proposition F.5. Assume m ≥ m * so that the nonconvex problem in equation 11 and its convex reformulation in equation 25 are equivalent. The solution set of the vector-valued problem
min {wi,zi} m i=1 1 2 ∥ m i=1 (Xw i ) + z T i -Y ∥ 2 2 + β 2 m i=1 ∥w i ∥ 2 2 + ∥z i ∥ 2 2 ,
where
w i ∈ R d×1 , z i ∈ R c×1 is given as S = (w i , z i ) m i=1 | ϕ((w i , z i ) m i=1 ) ∈ P * vec , R((w i , z i ) m i=1 ) = ∥ϕ((w i , z i ) m i=1 )∥ Ki, * , ∥w i ∥ 2 = ∥z i ∥ 2 , ∀i ∈ [m] , where ϕ((w i , z i ) m i=1 ) = (V i ) P i=1 := V p = 0 if ∄ w i s.t. D p = diag(1(Xw i ≥ 0)) tp j=1 w aj z T aj if D p = diag(1(Xw aj ≥ 0)) f or j ∈ [t p ], R((w i , z i ) m i=1 ) = (R i ) P i=1 := R p = 0 if ∄ w i s.t. D p = diag(1(Xw i ≥ 0)) tp j=1 ∥w aj ∥ 2 ∥z aj ∥ 2 if D p = diag(1(Xw aj ≥ 0)) f or j ∈ [t p ]
, and P * vec is defined in Proposition F.4.
Proof. Let's note the solution set of equation 11 as Θ * . We will prove that Θ * = S. First, find a point (w
* i , z * i ) m i=1 in Θ * . When ϕ((w * i , z * i ) m i=1 ) = (V * i ) P i=1 , we know that m i=1 (Xw * i ) + (z * i ) T = P i=1 D i XV * i ,
hence the l 2 error is the same for both parameters. Also, we have that
P i=1 ∥V * i ∥ Ki, * ≤ m i=1 ∥w * i ∥ 2 ∥z * i ∥ 2 = 1 2 m i=1 ∥w * i ∥ 2 2 + ∥z * i ∥ 2 2
, Thus, when we note L noncvx as the loss function of equation 11 and note L cvx as the loss function of equation 25, we have that
L noncvx ((w * i , z * i ) m i=1 ) ≥ L cvx (ϕ((w * i , z * i ) m i=1 )),(27)
holds in general. As the minimal value of L noncvx and L cvx is the same, we have that ϕ((w * i , z * i ) m i=1 ) ∈ P * vec . Also, the inequality in equation 27 is actually an equality, and we have R((w i , z i ) m i=1 ) = (∥V i ∥ Ki, * ) P i=1 . Now we take a point (w i , z i ) m i=1 in S. We know that L cvx (ϕ((w i , z i ) m i=1 )) is the optimal value. Also, we know that L noncvx ((w i , z i ) m i=1 ) = L cvx (ϕ((w i , z i ) m i=1 )) because R((w i , z i ) m i=1 ) = (∥ϕ((w i , z i ) m i=1 )∥ Ki, * ) P i=1 and ∥w i ∥ 2 = ∥z i ∥ 2 ∀i ∈ [m]. At last, the fact that as m ≥ m * and the two optimal values are the same implies that (w i , z i ) m i=1 ∈ Θ * . Theorem F.2. Consider a L -layer neural network ∥W i ∥ 2 F , and denote its optimal set as Θ * . We can characterize a subset of Θ * , namely the set
Θ * k-1,k (Y ′ ,W ′ 1 , W ′ 2 , • • • , W ′ k-2 , W ′ k+1 , • • • W ′ L ) := θ = (W ′ i ) k-2 i=1 ⊕ (W k-1 , W k ) ⊕ (W ′ i ) L i=k+1 | θ ∈ Θ * , ( XW k-1 ) + W k = Y ′ .
Here, X = ((((XW ′ 1 )
+ W ′ 2 ) + ) • • • W ′ k-2 ) + . The expression of Θ * k-1,k (Y ′ , W ′ 1 , W ′ 2 , • • • , W ′ k-2 , W ′ k+1 , • • • W ′ L ) is given as θ =(W ′ i ) k-2 i=1 ⊕ (W k-1 , W k ) ⊕ (W ′ i ) L i=k+1 | θ ∈ Θ * , ϕ d k (W k-1 , W k ) ∈ P * vec,intp , R d k (W k-1 , W k ) = ∥ϕ d k (W k-1 , W k )∥ Ki, * , ∥(W k-1 ) •,i ∥ 2 = ∥(W k ) i,• ∥ 2 ∀i ∈ [d k ] ,
where ϕ m (A, B) = ϕ((A •,i , B i,• ) m i=1 ), R m (A, B) = R((A •,i , B i,• ) m i=1 ) for ϕ defined in Proposition F.5, and P * vec,intp is defined as
P * vec,intp = (c i Vi ) P i=1 | c i ≥ 0, P i=1 c i D i X Vi = Y ′ , Vi ∈ Zer(F(K i , X T D i N * , -1, ⟨, ⟩ M )) ,
for the dual optimum N * ∈ R n×c that minimizes ⟨N, Y ⟩ M subject to ⟨N, D i XA⟩ + β∥A∥ Ki, * ≥ 0 ∀A ∈ R d×c , i ∈ [P ].
Here K i = conv{ug T | (2D i -I)Xu ≥ 0, ∥ug T ∥ * ≤ 1}, where D i denotes all possible arrangements diag(1(Xh ≥ 0)).
Proof. The result is an application of Theorem F.1 to the vector-valued interpolation problem
d k i=1 ∥u i ∥ 2 2 + ∥v i ∥ 2 2 ,
subject to
d k i=1 (Xu i ) + v T i = Y ′ ,
where
u i ∈ R d k-1 ×1 , v i ∈ R d k+1 ×1
, and then applying Proposition F.5.
The characterization enables us to extend the connectivity result to vector-valued networks. 
(Xw i ) + z T i -Y ∥ 2 2 + β 2 m i=1 ∥w i ∥ 2 2 + ∥z i ∥ 2 2 , (28
)
where w i ∈ R d , z i ∈ R c for i ∈ [m], and Y ∈ R n×c . If m ≥ nc + 1, the solution set in parameter space Θ * ⊆ R m(d+c) is connected.
Proof. Let's take two solutions (w i , z i ) m i=1 , (w ′ i , z ′ i ) m i=1 ∈ Θ * . We write w as the direction of w, i.e. w/∥w∥ 2 for w ̸ = 0. The first claim we prove is that for given {(X wai ) + zT ai } m1 i=1 and {(X w′ bi ) + z′ T bi } m2 i=1 , consider the conic combination that satisfies for the optimal model fit Y * . Then ( √ c i wai , c i zai ) m1 ⊕( √ d i w′ bi , √ d i z′ bi ) m2 i=1 ⊕(0, 0) m-m1-m2 is an optimal solution of equation 28 when m 1 + m 2 ≤ m, given that w ai , w ′ bi ̸ = 0. To see this, we first see that for the dual variable N * , ⟨N * , (X wi ) + zT i ⟩ = -β for all w i ̸ = 0. Suppose D p = diag(1(Xw ai ≥ 0)) for i ∈ [t p ], the same notation as in the statement of Proposition F.5, and without loss of generality assume a 1 = i. As (w i , z i ) m i=1 ∈ S, we know R((w i , z i ) m i=1 ) = (∥V i ∥ Ki, * ) P i=1 . Hence, when we write V p = c p Vp for some Vp ∈ F (K p , X T D p N * , -β, ⟨, ⟩ M ), we first know that V p = tp j=1 ∥w aj ∥ 2 ∥z aj ∥ 2 waj zT aj . We can find such V p because if V p = 0, we could set all w ai = 0 and it will strictly decrease the objective. Note that ∥ Vp ∥ Kp, * = 1, yielding c p = ∥V p ∥ Kp, * = Hence, for all waj zT aj , we have that ⟨N * , D p X waj zT aj ⟩ = -β, for j ∈ [t p ]. This implies for all i ∈ [m], we have that when w i ̸ = 0, ⟨N * , (X wi ) + zT i ⟩ = -β, and same for w ′ i ̸ = 0. Now we are ready to prove the claim. We first know that the regression error is the same, as we have the same model fit Y * . The regularization error is given as
β m1 i=1 c i + m2 i=1 d i = -⟨N * , Y * ⟩.
Hence, the cost of the problem is the same for any choice of the conic combination, and ( √ c i wai , √ c i zai ) m1 i=1 ⊕ ( √ d i w′ bi , √ d i z′ bi ) m2 i=1 is optimal when m 1 + m 2 ≤ m. At last, suppose m ≥ nc + 1. Note that the vectors {(Xw i ) + z T i } m i=1 and {(Xw ′ i ) + z ′T i } m i=1 are matrices in nc -dimensional subspace. As any conic combination that sums up to Y * makes a solution, we can first prune both solutions to make them linearly independent, and then connect the two using the same idea introduced in Theorem D.1.
Corollary F.1. (Corollary 4 of the paper) Consider the optimization problem in equation 11. Suppose m ≥ nc + 1 and denote the optimal set of equation 11 as Θ * (m). For any θ := (w i , z i ) m i=1 ∈ R (d+c)m , there exists a continuous path from θ to any point θ * ∈ Θ * (m) with nonincreasing loss.
Proof. The proof is identical to that of Corollary 2. From Haeffele & Vidal (2017), we know that when m ≥ nc + 1, the vector-valued training problem in equation 11 has no strict local minimum, i.e. all local minima are global. Now from any θ, move to a local minimum using a path with nonincreasing loss, then the local minimum is global. As Θ * (m) is connected, we know that we can arrive at any global minimum using a path with nonincreasing loss.
Finally, we extend our theory to parallel neural networks with depth 3. We have an optimal polytope characterization that states the first-layer weights have a finite set of fixed possible directions. 
W 1 = [v 1 / ∥v 1 ∥ 2 | • • • |v m1 / ∥v m1 ∥ 2 ], w 2 = [ ∥v 1 ∥ 2 , • • • , ∥v m1 ∥ 2 ] T .
With change of variables, the cones are written as
K v (D i (m 1 ), s, D ′ j ) = (v k ) m1 k=1 | (2D ik -I)s k Xv k ≥ 0, ∀k ∈ [m 1 ], (2D ′ j -I) m1 k=1 D ik Xv k ≥ 0 ,
which is a fixed cone in R d . Hence, Q X can be rewritten as
Q X = ( P 1 m 1 ) i=1 s∈{-1,1} m 1 P2(i) j=1 m1 k=1 D ′ j D ik Xv k | (v k ) m1 k=1 ∈K v (D i (m 1 ), s, D ′ j ), m1 k=1 ∥v k ∥ 2 ≤ 1 .
As a result, we have found a piecewise linear expression of Q X . When y = 0, we know that the optimal weights are all zeros. If not, we know that the problem minimize t ≥ 0 subject to y ∈ tC has a dual variable ν * that satisfies: if y = t * ( m i=1 λ i c i ) for some c i ∈ C, all c i s are minimizers of (ν * ) T c subject to c ∈ C. To see this fact, consider the supporting hyperplane on y. We can find a vector that satisfies ⟨ν * , y⟩ ≤ ⟨ν * , t * c⟩ for all c ∈ C and ⟨ν * , y⟩/t * ≤ ⟨ν * , c⟩ for all c ∈ C. Write y = t * ( m i=1 λ i c i ) and apply inner product with ν * to see the wanted result. More specifically, we have that λ i ⟨ν * , y⟩ ≤ λ i ⟨ν * , t * c i ⟩ for all i ∈ [m], and add them to see that the inequalities are actually an equality, and ⟨ν * , y⟩ = ⟨ν * , t * c i ⟩ for all i ∈ [m].
Hence, noting C as Conv(Q X ∪ -Q X ), there exists a dual variable ν * where the optimal (W 1 , w 2 ) must lie in the set arg max ∥W1∥ F ≤1,∥w2∥2≤1 |(ν * ) T ((XW 1 ) + w 2 ) + |. For each constraint set
(v k ) m1 k=1 ∈ K v (D i (m 1 ), s, D ′ j ), m1 k=1 ∥v k ∥ 2 ≤ 1,
we are optimizing a linear function over this set (as the ReLU expression is a linear function of (v k ) m1 k=1 ). If there exists two different maximizers of the problem (v k ) m1 k=1 , (v ′ k ) m1 k=1 , the average of the two will still be in the cone and satisfy the norm constraint strictly. Say (v ′′ k ) m1 k=1 is the average of the two solutions -the cost function (which is either (ν * ) T m1 k=1 D ′ j D i1 Xv k or its negation) value will be the same, but m1 k=1 ∥v ′′ k ∥ 2 < 1. Multiplying 1/ m1 k=1 ∥v ′′ k ∥ 2 leads to a contradiction in the optimality. Hence, for fixed cone K v (D i (m 1 ), s, D ′ j ), the optimal (v k ) m1 k=1 are fixed. As v k = (W 1 ) •,k w 2k , the direction of the columns of W 1 are fixed to a finite set of values.

Section: G ADDITIONAL DISCUSSIONS
In this section, we discuss the geometrical intuition of the dual optimum, non-unique solutions, and also explain why assumption Simsek et al. (2021) might not hold in our case.
The specific problem of interest is interpolating the dataset {(-√ 3, 1), ( √ 3, 1)} with a two-layer neural network with bias. We want to find a minimum-norm interpolator, where the cost function also includes regularizing the bias. We can write the problem as See Pilanci & Ergen (2020) for a similar "scaling trick".
In other words, when we denote Q X = {( Xu) + | ∥u∥ 2 ≤ 1}, the problem becomes Pilanci & Ergen (2020) min t ≥ 0 subject to y ∈ tConv(Q X ∪ -Q X ).
Figure 9 shows the shape of Q X and its convex hull. Figure 9: The shape of Conv(Q X ∪ -Q X ). We can see that the line x + y = 2 is tangent to the set {Xu | ∥u∥ 2 ≤ 1}, and meets with two points (2, 0), (0, 2) on the set Q X . Hence, Conv(Q X ∪-Q X ) is exactly the diamond |x| + |y| ≤ 2.
One thing to notice is that in Figure 9b, the line x + y = 2 meets with Q X with three points, and the convex hull Conv(Q X ∪ -Q X ) is a diamond. The intuition of the dual variable is that it is the normal vector of a face where the optimal fit exists. In our case, y = [1, 1] T lies on the exact line x + y = 2. Hence the dual optimum is ν * = [1, 1] T . We can also construct different minimum-norm interpolators by linear combinations of the three green points in Figure 9b: we can express y by only using the middle point (1, 1) -here, the interpolator becomes y = 1. We can use two points (2, 0) and (0, 2) to express (1, 1) -here, we have another interpolator that has two breakpoints. We can use three points -where will infinitely many ways to express (1, 1), that leads to a continuum of interpolators.
The assumption in Simsek et al. (2021) that there exists a unique model with zero loss and minimal width does not work here. We can adapt it to the regularized case, and assume that there exists a unique interpolator with minimal width and a solution to Then, there exist two ways to express y as a conic combination of (2, 0), (1, 1), and (0, 2) with two points. As y is not parallel to [2, 0], [1, 1], [0, 2], we can see that m * = 2 is minimal. Hence we don't have uniqueness of the smallest model in this case, and the results in Simsek et al. (2021) will not apply in general.

Section: ACKNOWLEDGEMENTS
This work was supported in part by the National Science Foundation (NSF) under CAREER award CCF-2236829, in part by the U.S. Army Research Office Early Career Award W911NF-21-1-0242, and in part by the Office of Naval Research under Grant N00014-24-1-2164.

Section: 
P * (m * ) is finite from Proposition D.2, we know that Θ * (m * ) is also finite. Proof of ii) is rather simple: we know that the solution in Θ * (m * ) will have a zero slot in Θ * (m * +1), and by using the moving operation in the proof of Proposition D.12, we can show that we can move a neuron to the zero slot. Proof of iii) follows from Proposition D.3 and Proposition D.14. We can find an isolated point with cardinality M * and is also canonical: the isolated point in Proposition D.3 has cardinality M * and is in P * irr . Say the isolated point is (u i , v i ) P i=1 . If diag(1(Xu i ≥ 0)) = diag(1(Xu j ≥ 0)) = D p for some i ̸ = j and u i ̸ = 0, u j ̸ = 0, D p Xu i and D p Xu j are colinear, which leads to a contradiction. Hence all patterns are different for u, v, and by appropriate rearrangement, we can find a canonical solution with cardinality M * . Say that solution is P can Now we apply Proposition D.14 to see that
Proof of iv) almost directly follows from Proposition D.12. Note that when w 1 = w 2 = • • • w m , we can prune the solution to connect it with a different solution, hence there is no isolated point. Proof of v) directly follows from Proposition D.5, Theorem D.1 and Proposition D.13. Note that the paths constructed in Proposition D.5 and Theorem D.1 has only finitely many cardinality changes, hence we can apply Proposition D.13. The solution set is not a singleton because m ≥ m * + 1.
Corollary D.1. (Corollary 2 of the paper) Consider the optimization problem in equation 3. Suppose m ≥ n + 1 and denote the optimal set of equation 3 as Θ * (m). For any θ := (w i , α i ) m i=1 ∈ R (d+1)m , there exists a continuous path from θ to any point θ * ∈ Θ * (m) with nonincreasing loss.
Proof. The proof almost directly follows from Haeffele & Vidal (2017) and Theorem D.1. First, we know the existence of a continuous path with nonincreasing loss from any point to any local minimum. We know that the local minimum is actually global from Haeffele & Vidal (2017). Now, we know that the set Θ * (m) is connected: hence, we can construct a continuous path from that global minimum to any global minimum we want. Note that we can apply the result of Haeffele and Vidal (Grohs & Kutyniok (2022)
are nondegenerate pairs.
Corollary D.2. (Corollary 3 of the paper) Consider the optimization problem in equation 3. Suppose m ≥ n + 1 and denote the objective in equation 3 as L(θ), where θ := (w i , α i ) m i=1 ∈ R (d+1)m . Let the optimal value of equation 3 as p * . For any λ greater than or equal to p * , we have that the sublevel set {θ | L(θ) ≤ λ} is connected.
Proof. Take two points θ 1 , θ 2 that satisfies L(θ 1 ), L(θ 2 ) ≤ λ. Fix an arbitrary θ * from the optimal set Θ * . From Corollary 2, we know the existence of a path with nonincreasing loss from θ 1 to θ * , and θ 2 to θ * . Hence we found a path inside the sublevel set {θ | L(θ) ≤ λ} that connects θ 1 and θ 2 . This means that the sublevel set is connected.
Proof. Note that if we apply Theorem F.1 to the given problem, we have almost identical acterization from Proposition F.1, except for the free skip connection. For the skip connection u 0 ∈ R d , we know that min u0∈R d ⟨X T ν * , u 0 ⟩ = 0 for all u 0 ∈ R d because u 0 is unconstrained. This means X T ν * = 0.
Next we give applications of Theorem F.1 to different architectures. We start by characterizing the optimal set of a vector-valued neural network with weight decay. Proposition F.4. The solution set of the convex reformulation of the vector-valued problem given as
where the norm ∥V ∥ Ki, * is defined as
The optimal solution set of equation 25 is given as
where
Proof. Let's define A i as
which is a block matrix in A i ∈ R nc×dc . Also, define the flattening operation F l dc : R d×c → R dc and F l nc : R n×c → R nc . For optimization variables θ i ∈ R dc , we have the equivalent problem
Here, we are merely flattening each V i s to make it into a vector-input optimization problem. When we apply Theorem F.1 to the flattened problem, we have the optimal set
where
dc , F l -1 nc to go back to the original solution space and recover P * vec . First, we know that A i θi = F l nc (D i XF l -1 dc ( θi )). Hence, the constraint
Also, consider the set
Here, 
from Wang et al. (2021a). Furthermore, when we write
Now we find a cone-constrained linear expression of ((XW 1 ) + w 2 ) + . Let's denote D = {D i } P1 i=1 as the set of all possible arrangement patterns diag(1(Xh ≥ 0)) and D(m 1 ) denote all possible P1 m1 size m 1 tuples of elements in D. Let's note
Given D i (m 1 ), s, we define the set
j=1 as the set of all possible arrangements of diag(1( Xh ≥ 0)), where X
When D i (m 1 ), s, D ′ j are fixed, and (W 1 ) •,i (which denote the i -th column of W 1 ), w 2i are fixed in sets:
(2D ik -I)X(W 1 ) •,k ≥ 0, s k w 2k ≥ 0 ∀k ∈ [m 1 ],
(2D ′ j -I)( 


References:
[b0] Jonathan Samuel K Ainsworth; Siddhartha Hayase;  Srinivasa (2022). Git re-basin: Merging models modulo permutation symmetries. 
[b1] Danil Akhtiamov; Matt Thomson (2023). Connectedness of loss landscapes via the lens of morse theory. PMLR
[b2] Maksym Andriushchenko; D' Francesco; Aditya Angelo; Nicolas Varre;  Flammarion (2023). Why do we need weight decay in modern deep learning?. 
[b3] Alberto Bietti; Joan Bruna; Clayton Sanford; Min Jae Song (2022). Learning single-index models with shallow neural networks. Advances in Neural Information Processing Systems
[b4] Etienne Boursier; Nicolas Flammarion (2023). Penalising the biases in norm regularisation enforces sparsity. Advances in Neural Information Processing Systems
[b5] Stephen Boyd; Lieven Vandenberghe (2004). Convex optimization. Cambridge university press
[b6] Johanni Brea; Berfin Simsek; Bernd Illing; Wulfram Gerstner (2019). Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape. 
[b7] Maria Sofia Bucarelli; Giuseppe Alessio D'inverno; Monica Bianchini; Franco Scarselli; Fabrizio Silvestri (2024). A topological description of loss surfaces based on betti numbers. Neural Networks
[b8] Yaim Cooper (2021). Global minima of overparameterized neural networks. SIAM Journal on Mathematics of Data Science
[b9] M Thomas;  Cover (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers
[b10] Tolga Ergen; Mert Pilanci (2020). Training convolutional relu neural networks in polynomial time: Exact convex optimization formulations. 
[b11] Tolga Ergen; Mert Pilanci (2021). Global optimality beyond two layers: Training deep relu networks via convex programs. PMLR
[b12] Tolga Ergen; Mert Pilanci (2023). The convex landscape of neural networks: Characterizing global optima and stationary points via lasso models. 
[b13] C Daniel; Freeman ; Joan Bruna (2016). Topology and geometry of half-rectified network optimization. 
[b14] Timur Garipov; Pavel Izmailov; Dmitrii Podoprikhin; P Dmitry; Andrew G Vetrov;  Wilson (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems
[b15] Rong Ge; Jason D Lee; Tengyu Ma (2017). Learning one-hidden-layer neural networks with landscape design. 
[b16] Philipp Grohs; Gitta Kutyniok (2022). Mathematical aspects of deep learning. Cambridge University Press
[b17] D Benjamin; René Haeffele;  Vidal (2017). Global optimality in neural network training. 
[b18] Boris Hanin (2021). Ridgeless interpolation with shallow relu networks in 1d is nearest neighbor curvature extrapolation and provably generalizes on lipschitz functions. 
[b19] Nirmit Joshi; Gal Vardi; Nathan Srebro (2023). Noisy interpolation learning with shallow univariate relu networks. 
[b20] Kenji Kawaguchi (2016). Deep learning without poor local minima. Advances in neural information processing systems
[b21] Rohith Kuditipudi; Xiang Wang; Holden Lee; Yi Zhang; Zhiyuan Li; Wei Hu; Rong Ge; Sanjeev Arora (2019). Explaining landscape connectivity of low-cost solutions for multilayer nets. Advances in neural information processing systems
[b22] Jonathan Daniel Kunin; Aleksandrina Bloom; Cotton Goeva;  Seed (2019). Loss landscapes of regularized linear autoencoders. PMLR
[b23] Dawei Li; Tian Ding; Ruoyu Sun (2022). On the benefit of width for neural networks: Disappearance of basins. SIAM Journal on Optimization
[b24] Hao Li; Zheng Xu; Gavin Taylor; Christoph Studer; Tom Goldstein (2018). Visualizing the loss landscape of neural nets. Advances in neural information processing systems
[b25] Shiyu Liang; Ruoyu Sun; Jason D Lee; Rayadurgam Srikant (2018). Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems
[b26] Shiyu Liang; Ruoyu Sun;  Srikant (2022). Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. SIAM Journal on Optimization
[b27] Aaron Mishkin; Mert Pilanci (2023). Optimal sets and solution paths of relu networks. PMLR
[b28] Julia Nakhleh; Joseph Shenouda; Robert D Nowak (2024). The effects of multi-task learning on relu neural network functions. 
[b29] Quynh Nguyen (2019). On connected sublevel sets in deep learning. PMLR
[b30] Quynh Nguyen (2021). A note on connectivity of sublevel sets in deep learning. 
[b31] Pierre Quynh N Nguyen; Marco Bréchet;  Mondelli (2021). When are solutions connected in deep networks. Advances in Neural Information Processing Systems
[b32] Rahul Parhi; Robert D Nowak (2023). Deep learning meets sparse regularization: A signal processing perspective. IEEE Signal Processing Magazine
[b33] Mert Pilanci; Tolga Ergen (2020). Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks. PMLR
[b34] Arda Sahiner; Tolga Ergen; John Pauly; Mert Pilanci (2020). Vector-output relu neural network problems are copositive programs: Convex analysis of two layer networks and polynomial-time algorithms. 
[b35] Pedro Savarese; Itay Evron; Daniel Soudry; Nathan Srebro (2019). How do infinite width bounded norm networks look in function space. PMLR
[b36] Ekansh Sharma; Devin Kwok; Tom Denton; M Daniel; David Roy; Gintare Rolnick; Dziugaite Karolina (2024). Simultaneous linear connectivity of neural networks modulo permutation. 
[b37] Berfin Simsek; Arthur Franc ¸ois Ged; Francesco Jacot; Clément Spadaro; Wulfram Hongler; Johanni Gerstner;  Brea (2021). Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. PMLR
[b38] Ruoyu Sun; Dawei Li; Shiyu Liang; Tian Ding; Rayadurgam Srikant (2020). The global landscape of neural networks: An overview. IEEE Signal Processing Magazine
[b39] Luca Venturi; Afonso S Bandeira; Joan Bruna (2019). Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research
[b40] René Vidal; Zhihui Zhu; Benjamin D Haeffele (2022). Optimization landscape of neural networks. 
[b41] Yifei Wang; Tolga Ergen; Mert Pilanci (2021). Parallel deep neural networks have zero duality gap. 
[b42] Yifei Wang; Jonathan Lacotte; Mert Pilanci (2021). The hidden convex optimization landscape of regularized two-layer relu networks: an exact characterization of optimal solutions. 
[b43] Yaoqing Yang; Liam Hodgkinson; Ryan Theisen; Joe Zou; Joseph E Gonzalez; Kannan Ramchandran; Michael W Mahoney (2021). Taxonomizing local versus global structure in neural network loss landscapes. Advances in Neural Information Processing Systems
[b44] Emi Zeger; Yifei Wang; Aaron Mishkin; Tolga Ergen; Emmanuel Candès; Mert Pilanci (2024). A library of mirrors: Deep neural nets in low dimensions are convex lasso models with reflection features. 
[b45] Bo Zhao; Nima Dehmamy; Robin Walters; Rose Yu (2023). Understanding mode connectivity via parameter space symmetry. 

Figures:
Figure fig_0: 
Type: figure
Caption: of neurons min{m * + M * , n + 1}
Data: 

Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure2: Staircase of connectivity for a toy example. The figures above the horizontal line show the toy problem's loss landscape as the width m changes. The red star denotes a single optimal solution while the blue line denotes a continuum of optimal solutions. The figures below the horizontal line show the corresponding optimal functions. The red/blue functions correspond to the functions parametrized by the red/blue sets in the loss landscape. Note that when m = 3 = min{m * + M * , n + 1}, there exists a continuous deformation from one solution to another.
Data: 

Figure fig_2: 
Type: figure
Caption: Example 2. (Example of a class of non-unique optimal interpolators) Consider the figure in Figure 3a. Following the arrow, we can find s n , s n-1 , • • • s 1 defined in Proposition E.4, and can find that each s i has norm 1, s n = [0, 1] T and v n = [ √ 3/2, 1/2] T . For the example in Figure 3a, the data x and y can be chosen as x = y ≃[94, 29, 24, 20, 20]  T .
Data: 

Figure fig_3: 
Type: figure
Caption: Figure3: A demonstration of non-unique interpolators for n = 5. Figure3ashows the geometric construction behind finding v s proposed in Proposition 3. Figure3bshows the continuum of optimal interpolators, and Figure3cshows the learned interpolators trained by gradient descent.
Data: 

Figure fig_4: 4
Type: figure
Caption: Figure 4 :4Figure 4: A contour plot of the loss landscape The three figures show the contour plot of the loss landscape shown in Figure 2. We can see the staircase of connectivity more clearly.
Data: 

Figure fig_5: 5
Type: figure
Caption: Figure 5 :5Figure 5: Learned functions found by gradient descent The two figures show what functions gradient descent learns for the toy problem in Example 1. For both cases in m = 3, m = 5, either gradient descent gets stuck at a local minimum or finds one of the optimal networks in the continuum of optimal solutions.
Data: 

Figure fig_6: 6
Type: figure
Caption: Figure 6 :6Figure 6: Learned interpolators found by gradient descent The two figures show what functions gradient descent learns for the toy problem in Example 2. We set β = 0.1 to approximately solve the minimum-norm interpolation problem. For both cases m = 6 and m = 10, either gradient descent gets stuck at a local minimum or finds one of the optimal networks in the continuum of optimal solutions.
Data: 

Figure fig_7: 
Type: figure
Caption: Corollary C.1. (Corollary 1 of the paper) Consider the optimization problem in equation 3. Denote the set of Clarke stationary points of equation 3 as Θ C . The set m j=1
Data: 

Figure fig_8: 5
Type: figure
Caption: Proposition D. 5 .5The set P * (m * + M * ) is connected.Proof. We first prove that two points A, B ∈ P * (m * + M * ) ∩ P * irr are connected with a continuous path in P * (m * + M * ). Take two points A ̸ = B ∈ P * (m * + M * ) ∩ P * irr . One observation that we can make is that A has a continuous path in P * (m * + M * ) to a certain solution A m ∈ P * irr that satisfies card(A m ) = m * . The construction of such a path is simple: interpolate A and A m , i.e. f (t) = (1 -t)A + tA m . As we know card((1 -t)A + tA m ) ≤ card(A) + card(A m ) ≤ M * + m * , the path has cardinality ≤ m * + M * . The last inequality follows from the fact that A ∈ P * irr and card(A) ≤ M * . Due to the polytope characterization, P * is convex, and the path is in P * . Combining these two, we know the existence of a path from A to A m in P * (m * + M * ). Similarly, we know the existence of a path from B to B m in P * (m * + M * ). At last, we know the existence of a path from A m to B m , again by interpolating these two. Concluding, for any two A, B ∈ P * (m * + M * ) ∩ P * irr , there exists a continuous path in P * (m * + M *
Data: 

Figure fig_9: 
Type: figure
Caption: fiw (T )a iw + s w=1 g jw (T )b jw = y * . • (Applying Lemma D.1) We also know that k w=1 µ w b w = y * and µ w > 0 ∀w ∈ [k]. Now we check the conditions to apply Lemma D.1 with A := C, B := B, given subset {a i1 , a i2 , • • • , a ir } ⊆ A. Then we know that C, B are linearly independent, w = y * , and r w=1 f iw (T ) > 0, µ w > 0 for all w ∈ [k]. Thus all conditions for Lemma D.1 are met, and we can find λiw for w ∈ [r], μjw for w ∈ [s], µ * that satisfies ∥µ * ∥ 0 ≤ n + 1 -r -s, µ * ≥ 0,
Data: 

Figure fig_10: 
Type: figure
Caption: = 0, choose η ′ w to find at least one η ′ w > 0 for w ∈ [y]. Then, write
Data: 

Figure fig_11: 
Type: figure
Caption: gjw (T )b jw = y * , and if identical C appeared twice we will have the same value of r w=1 f iw (T ) = m i=1 f i (t), which is contradicting the fact that m i=1 f i (t) strictly decreases.
Data: 

Figure fig_12: 
Type: figure
Caption: (a) Six linear regions of (Xu)+.(b) The set Conv(QX ∪ -QX ). y is denoted with the black point.
Data: 

Figure fig_13: 7
Type: figure
Caption: Figure 7 :7Figure7: The shape of Q X for a certain 2d data. The input space is split into six regions, and each region becomes either a 1d or a 2d object in a 3-dimensional space. Observe that Q X meets with x + y + z = 1 with six points.
Data: 

Figure fig_14: 8
Type: figure
Caption: Figure 8 :8Figure8: Two different minimum-norm interpolators. We can see that the V shape is the minimumnorm interpolator, and one is the rotation of the other.
Data: 

Figure fig_15: 
Type: figure
Caption: ūi , vi are fixed directions found by solving the optimization problem ūi = arg min u∈Si ν * T D i Xu if min u∈Si ν * T D i Xu = -1, 0 otherwise, vi = arg min v∈Si -ν * T D i Xv if min v∈Si -ν * T D i Xv = -1, 0 otherwise. and ν * is any dual optimum that satisfies ν * = arg min⟨ν, y⟩ subject to |ν T
Data: 

Figure fig_16: 
Type: figure
Caption: (Proposition F.2. (The staircase of connectivity for minimum-norm interpolation problem) Write the solution of the optimization problem min Xw i ) + α i = y, as Θ * (m). Suppose y ̸ = 0. As m changes, we have that (i) m = m * , Θ * (m) is a finite set. Hence, for any two optimal points A ̸ = A ′ ∈ Θ * (m), there is no path from A to A ′ inside Θ * (m).
Data: 

Figure fig_17: 
Type: figure
Caption: (iv) m ≥ M * + 1, permutations of the solution are connected. Hence, for all A ∈ Θ * (m), there existsA ′ ̸ = A in Θ * (m) and a path in Θ * (m) that connects A and A ′ . (v) m ≥ min{m * + M * , n + 1}, the set Θ * (m) is connected, i.e.for any two optimal pointsA ̸ = A ′ ∈ Θ * (m), there exists a continuous path from A to A ′ .Proof. The proof follows from observing that Proposition D.2, Proposition D.3, Proposition D.4, Proposition D.5, Theorem D.1, Proposition D.7, Proposition D.12, Proposition D.13, Proposition D.14 holds for interpolation problems too. We can apply the same proof strategy for Proposition D.2, Proposition D.3, Proposition D.4, Proposition D.5, Theorem D.1
Data: 

Figure fig_18: 
Type: figure
Caption: f θ (X) = ((((XW 1 ) + W 2 ) + • • • )W L-1 ) + W L where W i ∈ R di-1×di , d 0 = d and θ = (W )i=1 . Consider the training problem
Data: 

Figure fig_20: 
Type: figure
Caption: m1 i=1 c i (X wai ) + zT ai + m2 i=1 d i (X w′ bi ) + z′ T bi = Y * ,
Data: 

Figure fig_21: 
Type: figure
Caption: tp j=1 ∥w aj ∥ 2 ∥z aj ∥ 2 , and Vp is a convex combination of waj zT aj λ j s sum up to 1. We know that ⟨N * , D p X Vp ⟩ = -β and N * satisfy min A∈Kp ⟨N * , D p XA⟩ ≥ -β.
Data: 

Figure fig_22: 
Type: figure
Caption: Theorem F.4. (Theorem 3 of the paper) Consider the training problem min m,{W1i,w2i,αi} m i=1 1 3 m i=1∥W 1i ∥ 3 F + ∥w 2i ∥ 3 2 + |α i | 3 subject to m i=1 ((XW 1i ) + w 2i ) + α i = y.Now we consider the change of variables, where we write(W 1 ) •,k w = v k ∈ R . The norm constraint becomes m1 k=1 ∥v k ∥ 2 ≤ 1.To show this, we show that{((W 1 ) •,k w 2k ) m1 k=1 | ∥W 1 ∥ F ≤ 1, ∥w 2 ∥ 2 ≤ 1} = {(v k ) m1 k=1 | m1 k=1 ∥v k ∥ 2 ≤ 1}. First, for ∥W 1 ∥ F ≤ 1, ∥w 2 ∥ 2 ≤ 1, assume the column weights are a 1 , a 2 , • • • a m1 . Then we have a 2 1 + • • • a 2 m1 ≤ 1, w 2 21 + w 2 22 + • • • + w 2 2m1 ≤ 1,and use Cauchy-Schwartz to see that m1 k=1 a k |w 2k | ≤ 1. To prove the latter, choose
Data: 

Figure fig_23: 
Type: figure
Caption: (y = [1, 1] T . The last column of X denotes the bias term.The problem is equivalent tomin Xw i ) + α i = y, ∥w i ∥ 2 ≤ 1.
Data: 

Figure fig_24: 
Type: figure
Caption: (a) {Xu | ∥u∥2 ≤ 1} (b) {(Xu)+| ∥u∥2 ≤ 1} (c) Conv(QX ∪ -QX )
Data: 

Figure fig_25: 
Type: figure
Caption: (Xw i ) + α i = y. Here, X = [x | 1] ∈ R n×2 . Now choose x = [-√ 3, √ 3] as before, but choose y = [1/2, 3/2].
Data: 


Formulas:
Formula formula_0: min θ∈R p L(f θ (X), y) + βR(θ).(1)

Formula formula_1: θ∈R p L(f θ (X), y) + βR(θ) ⊆ R p , F * := {f θ | θ ∈ Θ * } ⊆ F, (2

Formula formula_2: )

Formula formula_3: K i = {u | (2D i -I)Xu ≥ 0} for i ∈ [P ]

Formula formula_4: if a ∈ R m and b ∈ R n , a ⊕ b ∈ R m+n , (a i ) p i=1 denotes a 1 ⊕ a 2 ⊕ • • • a p .

Formula formula_5: p * := min {wj ,αj } m j=1 L   m j=1 (Xw j ) + α j , y   + β 2 m j=1 ∥w j ∥ 2 2 + α 2 j .(3)

Formula formula_6: w j ∈ R d , α j ∈ R for j ∈ [m].

Formula formula_7: p * cvx := min {ui,vi} P i=1 , ui,vi∈Ki L P i=1 D i X(u i -v i ), y + β P i=1 (∥u i ∥ 2 + ∥v i ∥ 2 ) .(4)

Formula formula_8: (w i , α i ) = (u i / ∥u i ∥ 2 , ∥u i ∥ 2 ) for i ∈ [a], (w i+a , α i+a ) = (v i / ∥v i ∥ 2 , -∥v i ∥ 2 ) for i ∈ [m -a], without loss of generality assuming u i ̸ = 0 for i ∈ [a] and v i ̸ = 0 for i ∈ [m -a].

Formula formula_9: d * := max |ν T (Xu)+|≤β, ∀∥u∥2≤1 -L * (ν),(5)

Formula formula_10: C y = P i=1 D i X(u * i -v * i ) | (u * i , v * i ) P i=1 ∈ Θ * = {y * } for some y * ∈ R n .

Formula formula_11: P * ν * := (c i ūi , d i vi ) P i=1 | c i , d i ≥ 0 ∀i ∈ [P ], P i=1 D i X ūi c i -D i X vi d i = y * ⊆ R 2dP ,(6)

Formula formula_12: [P ], ((u i + u ′ i )/2, (v i + v ′ i )/2) P i=1

Formula formula_13: m j=1 w j /∥w j ∥ 2 | (w i , α i ) m i=1 ∈ Θ C , w j ̸ = 0

Formula formula_14: P * (m) := (u i , v i ) P i=1 | (u i , v i ) P i=1 ∈ P * , P i=1 1(u i ̸ = 0) + 1(v i ̸ = 0) ≤ m ⊆ R 2dP ,(7)

Formula formula_15: P * = {(0, 0) P i=1 } by assuming (w i , α i ) m i=1 ̸ = (0, 0) m i=1 ∈ Θ * (m)

Formula formula_16: (v) m ≥ min{m * + M * , n + 1}, the set Θ * (m) is connected.

Formula formula_17: m m = 1 m = 2 m = 3

Formula formula_18: (a i x j + b i ) + θ i -y j 2 + β 2 m i=1 (θ 2 i + a 2 i + b 2 i ).

Formula formula_19: min m,{ai,biθi} m i=0 m i=1 ∥a i ∥ 2 2 + b 2 i + θ 2 i , subject to Xa 0 + b 0 1 + m i=1 (Xa i + b i 1) + θ i = y. (8)

Formula formula_20: Q X = {(Xu) + | ∥u∥ 2 ≤ 1}, the convex set Conv(Q X ∪ -Q X )

Formula formula_21: min θi∈Ci∩Vi,si∈Di L( P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R i (θ i ).(9)

Formula formula_22: A i , B i ∈ R n×d , θ i , s i ∈ R d , y ∈ R n , C i , D i are proper cones, R i

Formula formula_23: P * gen = (c i θi ) P i=1 ⊕(s i ) Q i=1 | c i ≥ 0, P i=1 c i A i θi + Q i=1 B i s i ∈ C y , θi ∈ Θi , ⟨B T i ν * , s i ⟩ = 0, s i ∈ D i .

Formula formula_24: (wi,zi) m i=1 1 2 ∥ m i=1 (Xw i ) + z T i -Y ∥ 2 F + β 2 m i=1 ∥w i ∥ 2 2 + ∥z i ∥ 2 2 ,(11)

Formula formula_25: min Vi 1 2 ∥ P i=1 D i XV i -Y ∥ 2 2 + β P i=1 ∥V i ∥ Ki, * .

Formula formula_26: ∥V ∥ Ki, * := min t ≥ 0 s.t. V ∈ tK i , for K i = conv{ug T |(2D i - I)Xu ≥ 0, ∥ug T ∥ * ≤ 1}, and V i = span{ug T |(2D i -I)Xu ≥ 0, g ∈ R c }.

Formula formula_27: min m,{W1i,w2i,αi} m i=1 1 3 m i=1 ∥W 1i ∥ 3 F + ∥w 2i ∥ 3 2 + |α i | 3 s.t. m i=1 ((XW 1i ) + w 2i ) + α i = y. (12

Formula formula_28: )

Formula formula_29: min {(θi,ai,bi)} m i=1 1 2 2 j=1 m i=1 (a i x j + b i ) + θ i -y j 2 + β 2 m i=1 (θ 2 i + a 2 i + b 2 i ),

Formula formula_30: {(x i , y i )} 2 i=1 = {(- √ 3, 1), ( √ 3, 1)} and β = 0.1. When we write X = - √ 3, 1 √ 3, 1 ∈ R 2×2 , y = [1, 1]

Formula formula_31: min U ∈R 2×m ,v∈R m 1 2 ∥(XU ) + v -y∥ 2 2 + β 2 (∥U ∥ 2 F + ∥v∥ 2 2 ).

Formula formula_32: F (t, s) = L( t s , [r]), for (t, s) ∈ [-1, 1] × [-0.5, 2]. t = 0, s = r is the only optimum.

Formula formula_33: U 0 = 0 0 r 0 , U 1 = √ 3r/(2 √ 2) - √ 3r/(2 √ 2) r/(2 √ 2) r/(2 √ 2) , U 2 = 0 0 0 r , v 0 = r 0 , v 1 = r/ √ 2 r/ √ 2 , v 2 = 0 r .

Formula formula_34: F (t, s) = L(cos(t)U 0 + 2s(U 1 -U 0 ) + sin(t)U 2 , cos(t)v 0 + 2s(v 1 -v 0 ) + sin(t)v 2 ). for (t, s) ∈ [-0.25, 0.6] × [-0.5, 0.3].

Formula formula_35: U 0 = 0, 0, 0 r, 0, 0 , U 1 = 0 √ 3r/(2 √ 2) - √ 3r/(2 √ 2) 0 r/(2 √ 2) r/(2 √ 2) , U 2 = 0, 0, 0 0, r, 0 , v 0 = r 0 0 , v 1 =   0 r/ √ 2 r/ √ 2   , v 2 = 0 r 0 .

Formula formula_36: F (t, s) = L(cos(t) cos(s)U 0 +cos(t) sin(s)U 1 +sin(t)U 2 , cos(t) cos(s)v 0 +cos(t) sin(s)v 1 +sin(t)v 2 ). for (t, s) ∈ [-0.5, 1] × [-0.5, 1].

Formula formula_37: f (x) = √ κt( √ 3κt 2 x + √ κt 2 ) + + √ κt(- √ 3κt 2 x + √ κt 2 ) + + κ(1 -2t) κ(1 -2t) + , where κ = 1 -β/2 and t ∈ [0, 1/2]. For ν T = [1/2, 1/2], we know that max ∥u∥2≤1 |ν T (Xu) + | = 1.

Formula formula_38: y * = m i=1 (Xu i ) + α i . Then, ⟨ν, y * ⟩ ≤ m i=1 |ν T (X u i ∥u i ∥ 2 )|∥u i ∥ 2 |α i | ≤ 1 2 m i=1 ∥u i ∥ 2 2 + |α i | 2 .

Formula formula_39: √ 3/2 1/2 , √ 2/2 √ 2/2 , 1/2 √ 3/2 , √ 6 - √ 2/4 √ 6 + √ 2/4 , 0 1 , - √ 3/2 1/2 .

Formula formula_40: y = 20((X ū1 ) + + (X ū3 ) + + (X ū5 ) + ),

Formula formula_41: f (x) = (20 -7.076t)([x, 1] • ū1 ) + + (13.1592t)([x, 1] • ū2 ) + + (20 -13.1623t)([x, 1] • ū3 ) + + (13.159t)([x, 1] • ū4 ) + + (20 -7.081t)([x, 1] • ū5 ) + + t([x, 1] • ū6 ) + ,

Formula formula_42: C y = P i=1 D i X(u * i -v * i ) | (u * i , v * i ) P i=1 ∈ Θ * .

Formula formula_43: C y = P i=1 D i X(u * i -v * i ) | (u * i , v * i ) P i=1 ∈ Θ * , C y = {y * } for some y * ∈ R n . Proof. Assume y 1 , y 2 ∈ C y and y 1 ̸ = y 2 . Let P i=1 D i X(u i -v i ) = y 1 and P i=1 D i X(u ′ i -v ′ i ) = y 2 for (u i , v i ) P i=1 , (u ′ i , v ′ i ) P i=1 ∈ Θ * . Think of ( ui+u ′ i 2 , vi+v ′ i 2 ) P i=1 = θ avg . The objective value of θ avg is L( y 1 + y 2 2 , y) + β P i=1 ∥ u i + u ′ i 2 ∥ 2 + ∥ v i + v ′ i 2 ∥ 2

Formula formula_44: 1 2 L(y 1 , y) + β P i=1 ∥u i ∥ 2 + ∥v i ∥ 2 + L(y 2 , y) + β P i=1 ∥u ′ i ∥ 2 + ∥v ′ i ∥ 2 .

Formula formula_45: min u∈Si ν T D i Xu, min u∈Si -ν T D i Xu,

Formula formula_46: S i = K i ∩ {u | ∥u∥ 2 ≤ 1}.

Formula formula_47: min u∈Si (w * ) T u,

Formula formula_48: (w * ) T u * 1 = (w * ) T u * 2 = p * = (w * ) T ( u * 1 + u * 2 2 ),

Formula formula_49: ∥u * 1 + u * 2 ∥ 2 < 2 because u * 1 ̸ = u * 2 . Scale (u * 1 + u * 2 )/2 to obtain contradiction that u * 1 is the minimizer.

Formula formula_50: ūi = arg min u∈Si ν * T D i Xu if min u∈Si ν * T D i Xu = -β, 0 otherwise, vi = arg min v∈Si -ν * T D i Xv if min v∈Si -ν * T D i Xv = -β, 0 otherwise.

Formula formula_51: ν * = arg max -L * (ν) subject to |ν T D i Xu| ≤ β∥u∥ 2 ∀u ∈ K i , i ∈ [P ].

Formula formula_52: (Xh ≥ 0)) for i ∈ [P ], S i = K i ∩ {u | ∥u∥ 2 ≤ 1}.

Formula formula_53: P * ν * := (c i ūi , d i vi ) P i=1 | c i , d i ≥ 0 ∀i ∈ [P ], P i=1 D i X ūi c i -D i X vi d i = y * ⊆ R 2dP ,(13

Formula formula_54: Θ * ⊆ P * ν * . Take a point (u * i , v * i ) P i=1 ∈ Θ * . We first know that P i=1 D i X(u * i - v * i ) = y * from Proposition 1. What we would like to do is showing the existence of c i , d i that satisfies c i ≥ 0, u * i = c i ūi , d i ≥ 0, v * i = d i vi , where ūi , vi are ūi = arg min u∈Si ν * T D i Xu if min u∈Si ν * T D i Xu = -β, 0 otherwise, vi = arg min v∈Si -ν * T D i Xv if min v∈Si -ν * T D i Xv = -β, 0 otherwise. Consider the Lagrangian L((u i , v i ) P i=1 , z, ν) = L(z, y) -ν T z + P i=1 (β∥u i ∥ 2 + ν T D i Xu i ) + P i=1 (β∥v i ∥ 2 -ν T D i Xv i ),

Formula formula_55: min ui,vi∈Ki,z max ν L((u i , v i ) P i=1 , z, ν) = max ν min ui,vi∈Ki,z L((u i , v i ) P i=1 , z, ν),

Formula formula_56: A = {(w - P i=1 D i X(u i -v i ), t) | u i , v i ∈ K i , L(w, y) + β P i=1 ∥u i ∥ 2 + ∥v i ∥ 2 ≤ t},

Formula formula_57: B = {(0, s) | s < p * }, it is clear that A ∩ B = ∅.

Formula formula_58: (ν, μ) ∈ R n × R which is nonzero, α such that (z, t) ∈ A ⇒ νT z + μt ≥ α ≥ μp * ,

Formula formula_59: P i=1 D i X(u i -v i ))

Formula formula_60: max ν min (ui,vi) P i=1 ,z L((u i , v i ) P i=1 , z, ν) writes maximize -L * (ν) subject to β∥u∥ 2 ≥ |ν T D i Xu| ∀u ∈ K i , i ∈ [P ].

Formula formula_61: u ′ i that satisfies u ′ i ∈ K i and ν ′T D i Xu ′ i + β∥u ′ i ∥ 2 < 0. As we can scale t → ∞ for tu ′ i to see that for that ν ′ , min (ui,vi) P i=1 ,z L((u i , v i ) P i=1 , z, ν ′ ) = -∞.

Formula formula_62: T D i Xu + β∥u∥ 2 ≥ 0 for all u ∈ K i , i ∈ [P ]. Similarly, we only need to see ν that satisfies -ν T D i Xu + β∥u∥ 2 ≥ 0 for all u ∈ K i , i ∈ [P ]. Hence, ν * is the maximizer of max ν min z L(z, y) -ν T z subject to β∥u∥ 2 ≥ |ν T D i Xu| ∀u ∈ K i , i ∈ [P ],

Formula formula_63: ((u * i , v * i ) P i=1 , y * ), the function L((u i , v i ) P i=1 , z, ν * ) attains minimum at ((u * i , v * i ) P i=1 , y * )

Formula formula_64: L((u i , v i ) P i=1 , z, ν * ) = L(z, y) -ν * T z + P i=1 (β∥u i ∥ 2 + ν * T D i Xu i ) + P i=1 (β∥v i ∥ 2 -ν * T D i Xv i ),

Formula formula_65: 2 -ν * T D i Xu subject to u ∈ K i is 0 for all i ∈ [P ]. As ((u * i , v * i ) P i=1 , y * ) minimizes L((u i , v i ) P i=1 , z, ν * ), β∥u * i ∥ 2 + ν * T D i Xu * i = 0, β∥v * i ∥ 2 -ν * T D i Xv * i = 0.

Formula formula_66: i) When u * i = 0, let c i = 0 to find c i ≥ 0 that satisfies u * i = c i ūi . ii) When u * i ̸ = 0, notice that min u∈Si ν * T D i Xu = -β ̸ = 0,

Formula formula_67: u * i /∥u * i ∥ 2 . To see this, recall that (ν * ) T D i Xu + β∥u∥ 2 ≥ 0 and (ν * ) T D i Xu/∥u∥ 2 ≥ -β for all nonzero u ∈ K i , which implies that min u∈Si (ν * ) T D i Xu = -β.

Formula formula_68: u * i /∥u * i ∥ 2 = ūi . Hence choosing c i = ∥u * i ∥ 2 gives c i ≥ 0 that satisfies u * i = c i ūi . Hence, we have found c i ≥ 0, d i ≥ 0 that satisfies u * i = c i ūi , v * i = d i vi and P i=1 D i X(u * i -v * i ) = P i=1 D i X(c i ūi -d i vi ) = y * , meaning (u * i , v * i ) P i=1 ∈ P * . Now, we show that P * ν * ⊆ Θ * . Take an element (c i ūi , d i vi ) P i=1 ∈ P * ν * . It is clear that c i ūi ∈ C i , d i vi ∈ D i . If ūi ̸ = 0, we know that (ν * ) T D i X ūi = -β. Similarly, if vi ̸ = 0, we know that -(ν * ) T D i X vi = -β. Also, if ūi , vi ̸ = 0, ∥ū i ∥ 2 = 1, ∥v i ∥ 2 = 1,

Formula formula_69: P i=1 D i X(c i ūi -d i vi ) = y * , hence the objective becomes L(y * , y) + β ūi̸ =0 c i + β vi̸ =0 d i , using ∥ū i ∥ 2 = 1, ∥v i ∥ 2 = 1. Now, as P i=1 D i X(c i ūi -d i vi ) = y * , multiplying (ν * ) T on both sides gives ūi̸ =0 c i + vi̸ =0 d i = -⟨ν * , y * ⟩/β.

Formula formula_70: w j ∥w j ∥ 2 | (w i , α i ) m i=1 ∈ Θ C , w j ̸ = 0 ,

Formula formula_71: (w i , α i ) m i=1 ∈ Θ C , we have a convex program with subsampled arrangement patterns D1 , D2 , • • • Dm ∈ {D i } P i=1 , min ui,vi∈ Ki L m i=1 Di X(u i -v i ), y + β m i=1 ∥u i ∥ 2 + ∥v i ∥ 2 ,

Formula formula_72: (w i , α i ) = (u i / ∥u i ∥ 2 , ∥u i ∥ 2 ) if u i ̸ = 0, (v i / ∥v i ∥ 2 , -∥v i ∥ 2 ) if v i ̸ = 0.

Formula formula_73: m j=1 w j ∥w j ∥ 2 | (w i , α i ) m i=1 ∈ Θ C , w j ̸ = 0 ,

Formula formula_74: card((u i , v i ) P i=1 ) = P i=1 1(u i ̸ = 0) + 1(v i ̸ = 0).

Formula formula_75: P * (m) := (u i , v i ) P i=1 | (u i , v i ) P i=1 ∈ P * , card((u i , v i ) P i=1 ) ≤ m ⊆ R 2dP . (14

Formula formula_76: )

Formula formula_77: P * irr = (u i , v i ) P i=1 | (u i , v i ) P i=1 ∈ P * , {D i Xu i } ui̸ =0 ∪ {D i Xv i } vi̸ =0 linearly independent . (15

Formula formula_78: )

Formula formula_79: ": if the set {D i Xu i } ui̸ =0 ∪ {D i Xv i } vi̸ =0

Formula formula_80: {D i Xu i } ui̸ =0 ∪ {-D i Xv i } vi̸ =0 .

Formula formula_81: (w i , α i ) m i=1 ̸ = 0 ∈ Θ * (m) ii) There exists (u i , v i ) P i=1 ̸ = 0 ∈ P * iii) P * irr ̸ = ∅ Proof. i) ⇒ ii): First assume m ≥ 2P . Consider Φ((w i , α i ) m i=1 ) = (u i , v i ) P i=1

Formula formula_82: (w i , α i ) m i=1 ∈ Θ * (m)

Formula formula_83: 1 2 ∥y∥ 2 2 ≤ p * m .

Formula formula_84: ui̸ =0 c i D i Xu i + vi̸ =0 d i D i Xv i = 0,

Formula formula_85: (u ′ i , v ′ i ) P i=1 = ((1 + c i t)u i , (1 -d i t)v i ) P i=1 , where t = min{min ci<0 -1 ci , min di>0 1 di },

Formula formula_86: {D i Xu i } ui̸ =0 ∪ {D i Xv i } vi̸ =0 is linearly independent.

Formula formula_87: m * := min (ui,vi) P i=1 ∈P * irr card((u i , v i ) P i=1 ), M * := max (ui,vi) P i=1 ∈P * irr card((u i , v i ) P i=1 ).

Formula formula_88: u i ̸ = 0 ⇔ u ′ i ̸ = 0 and v i ̸ = 0 ⇔ v ′ i ̸ = 0 for i ∈ [P ].

Formula formula_89: P * (m * ) ⊆ P * , P i=1 D i X(u i -v i ) = y * = P i=1 D i X(u ′ i -v ′ i ).

Formula formula_90: i ̸ = 0} = {a 1 , a 2 , • • • a t }, {i|v i ̸ = 0} = {b 1 , b 2 , • • • b s }. We have that t + s ≤ m * as (u i , v i ) P i=1 ∈ P * (m * ). From Theorem 1, we know the existence of c ai , c ′ ai ≥ 0 for i ∈ [t] and d bi , d ′ bi ≥ 0 for i ∈ [s] that satisfies u ai = c ai ūai , u ′ ai = c ′ ai ūai , ∀i ∈ [t], v bi = d bi vbi , v ′ bi = d ′ bi vbi , ∀i ∈ [s]. This means that t i=1 c ai D ai X ūai - s i=1 d bi D bi X vbi = t i=1 c ′ ai D ai X ūai - s i=1 d ′ bi D bi X vbi = y * ,

Formula formula_91: {D ai X ūai } t i=1 ∪ {D bi X vbi } s i=1

Formula formula_92: |P * (m * )| ≤ m * j=1 2P j .

Formula formula_93: (u • i , v • i ) P i=1 ∈ P * irr , namely the solution (u • i , v • i ) P i=1 ∈ P * , {D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0 linearly independent, and card((u • i , v • i ) P i=1 ) = M * . Assume the existence of a continuous function f : [0, 1] → P * (M * ) satisfying f (0) = (u • i , v • i ) P i=1 , f (1) = (u ′ i , v ′ i ) P i=1 , f (0) ̸ = f (1). Now, write f (t) = (u i (t), v i (t)) P i=1 and define c i (t) = 0 if ūi = 0 ∥u i (t)∥ 2 otherwise, d i (t) = 0 if vi = 0 ∥v i (t)∥ 2 otherwise, For definition of ūi , vi , see Theorem 1. Some things to notice are: i) The functions c i (t), d i (t) : [0, 1] → R are continuous. ii) f (t) = (c i (t)ū i , d i (t)v i ) P

Formula formula_94: P i=1 1(c i (t) ̸ = 0) + 1(d i (t) ̸ = 0) ≤ M * , and P i=1 1(c i (0) ̸ = 0) + 1(d i (0) ̸ = 0) = M * .

Formula formula_95: • i , v • i ) P i=1 has cardinality M * . iv) We know that there exists t ′ ∈ [0, 1] that satisfies (c i (t ′ ), d i (t ′ )) P i=1 ̸ = (c i (0), d i (0)) P i=1 . It is because f (0) ̸ = f (1).

Formula formula_96: {D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0

Formula formula_97: (u • i , v • i ) P i=1 is isolated. Let's define t 1 as t 1 = inf t≥0 t | P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)) 2 > 0 .

Formula formula_98: P i=1 (c i (0) -c i (t 1 )) 2 + (d i (0) -d i (t 1 )) 2 = 0. (16

Formula formula_99: )

Formula formula_100: P i=1 (c i (0) -c i (t 1 )) 2 + (d i (0) -d i (t 1

Formula formula_101: P i=1 (c i (0) -c i (t 1 -ϵ)) 2 + (d i (0) -d i (t 1 -ϵ)) 2 > 0 because of continuity,

Formula formula_102: P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)) 2 > 0). Hence, for t ∈ [0, t 1 ], P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)) 2 = 0.

Formula formula_103: P i=1 (c i (0) -c i (t ϵ )) 2 + (d i (0) -d i (t ϵ )) 2 > 0. (17

Formula formula_104: )

Formula formula_105: P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)) 2 > 0 for some ϵ > 0, it means that for all t ∈ [0, t 1 + ϵ 2 ], P i=1 (c i (0) -c i (t)) 2 + (d i (0) -d i (t)

Formula formula_106: {D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0 is linearly dependent. From equation 16, we know that c i (0) = c i (t 1 ), d i (0) = d i (t 1 ) ∀i ∈ [P ].

Formula formula_107: M * = P i=1 1(c i (0) ̸ = 0) + 1(d i (0) ̸ = 0) ≤ P i=1 1(c i (t) ̸ = 0) + 1(d i (t) ̸ = 0) ≤ M * ,

Formula formula_108: P i=1 1(c i (t) ̸ = 0) + 1(d i (t) ̸ = 0) = M * , ∀t ∈ [t 1 -ϵ 0 , t 1 + ϵ 0 ].

Formula formula_109: t ∈ [t 1 -ϵ 0 , t 1 + ϵ 0 ], we know that c i (0) > 0 ⇔ c i (t) > 0 and d i (0) > 0 ⇔ d i (t) > 0.

Formula formula_110: P i=1 (c i (0) -c i (t ϵ0 )) 2 + (d i (0) -d i (t ϵ0 )) 2 > 0, (c i (0), d i (0)) P i=1 ̸ = (c i (t ϵ0 ), d i (t ϵ0 )) P i=1 . Also, c i (0) > 0 ⇔ c i (t ϵ0 ) > 0 and d i (0) > 0 ⇔ d i (t ϵ0 ) > 0. Now we have found two different solutions (c i (0)ū i , d i (0)v i ) P i=1 , (c i (t ϵ0 )ū i , d i (t ϵ0 )v i ) P i=1 ∈ P * (M * ), which means that P i=1 c i (0)D i X ūi -d i (0)D i X vi = y * = P i=1 c i (t ϵ0 )D i X ūi -d i (t ϵ0 )D i X vi .(18)

Formula formula_111: d i (0)) P i=1 ̸ = (c i (t ϵ0 ), d i (t ϵ0 )) P i=1 and c i (0) > 0 ⇔ c i (t ϵ0 ) > 0, d i (0) > 0 ⇔ d i (t ϵ0

Formula formula_112: {D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0 -hence the set {D i Xu • i } u • i ̸ =0 ∪ {D i Xv • i } v • i ̸ =0

Formula formula_113: P * (m) is connected. Proposition D.4. Consider (u i , v i ) P i=1 ∈ P * -P * irr . Let m = card((u i , v i ) P i=1 ).

Formula formula_114: (u ′ i , v ′ i ) P i=1 ∈ P * (m) ∩ P * irr .

Formula formula_115: A ′ to B ′ in P * (m * + M * ).

Formula formula_116: A = {a 1 , a 2 , • • • a m }, B = {b 1 , b 2 , • • • b k } ⊆ R n and a given subset I = {a i1 , a i2 , • • • a it } ⊂ A. Also, m i=1 λ i a i = k i=1 µ i b i ,

Formula formula_117: 1) ∥µ * ∥ 0 ≤ n -m + 1. 2) µ * ≥ 0. 3) k i=1 µ * i b i ∈ span({a 1 , a 2 , • • • a m }

Formula formula_118: k i=1 µ * i b i = m i=1 δ i a i , t j=1 δ ij > 0.

Formula formula_119: k i=1 μi b i ∈ span({a 1 , a 2 , • • • a m }), k i=1 μi b i = m i=1 δ i a i and t j=1 δ ij > 0,

Formula formula_120: A = {a 1 , a 2 , • • • , a m } and B = {b i |i ∈ [k], μi ̸ = 0}

Formula formula_121: i∈[k],μi̸ =0 μi b i = m i=1 δ i a i , t j=1 δ ij > 0,

Formula formula_122: |B • | ≤ n -m + 1, bi∈B • µ • i b i = m i=1 δ • i a i , t j=1 δ • ij > 0, µ • i > 0 if b i ∈ B • , 0 otherwise.

Formula formula_123: • | ≤ n -m + 1 is that if |B • | > n -m + 1,

Formula formula_124: {a 1 , a 2 , • • • a n }. Express each b i s as b i = n j=1 γ ij a j .

Formula formula_125: i ∈ [k]. Now, write Γ ∈ R n×k as Γ ij = γ ji .

Formula formula_126: Γµ = λ 0 n-m ,

Formula formula_127: k i=1 µ i b i . Now we know the set {µ ∈ R k | 1 T Γ[i 1 , i 2 , • • • i t ]µ = 0, Γ[m + 1] T µ = 0, • • • Γ[n] T µ = 0} has dimension at least k -n + m -1, as each linear constraint decreases the dimension at most 1. Here Γ[p 1 , p 2 , • • • p r ] ∈ R r×k denotes the concatenation of r rows of Γ, Γ[p 1 ] to Γ[p r ]. As k -n + m -1 > 0, there exists a nonzero µ ′ that satisfies Γµ ′ = λ ′ 0 n-m , t j=1 λ ′ ij = 0.

Formula formula_128: Γ(µ + ϵµ ′ ) = λ + ϵλ ′ 0 n-m ,

Formula formula_129: t j=1 λ ′ ij = 0, t j=1 (λ i J + ϵλ ′ ij ) > 0.

Formula formula_130: A = (u i , v i ) P i=1 , B = (u ′ i , v ′ i ) P i=1 . Also, let's write A = {D i X ūi } ui̸ =0 ∪ {-D i X vi } vi̸ =0 , B = {D i X ūi } u ′ i ̸ =0 ∪ {-D i X vi } v ′ i ̸ =0

Formula formula_131: A = {a 1 , a 2 , • • • a m } ⊆ R n , B = {b 1 , b 2 , • • • b k } ⊆ R n . At last, λ 1 , λ 2 , • • • λ m , µ 1 , µ 2 , • • • µ k are unique nonnegative numbers that satisfy m i=1 λ i a i = k i=1 µ i b i = y * .

Formula formula_132: F 1 , F 2 , • • • F m , G 1 , G 2 , • • • , G k : [0, 1] → R that satisfies: Property 1) F i (0) = λ i , F i (1) = 0, F i (t) ≥ 0 ∀i ∈ [m], t ∈ [0.1]. Property 2) G j (0) = 0, G j (1) = µ j , G j (t) ≥ 0 ∀j ∈ [k], t ∈ [0.1]. Property 3) m i=1 F i (t)a i + k j=1 G j (t)b j = y * ∀t ∈ [0, 1]. Property 4) m i=1 1(F i (t) > 0) + k j=1 1(G j (t) > 0) ≤ n + 1 ∀t ∈ [0, 1].

Formula formula_133: (u i (t), v i (t)) P i=1

Formula formula_134: u i (t) =        (F p (t) + G q (t))ū i if a p = b q = D i X ūi F p (t)ū i if a p = D i X ūi , ∄ q ∈ [k] such that b q = D i X ūi . G q (t)ū i if b q = D i X ūi , ∄ p ∈ [m] such that a p = D i X ūi 0 otherwise. v i (t) =        (F p (t) + G q (t))v i if a p = b q = -D i X vi F p (t)v i if a p = -D i X vi , ∄ q ∈ [k] such that b q = -D i X vi . G q (t)v i if b q = -D i X vi , ∄ q ∈ [m]such that a p = -D i X vi 0 otherwise.

Formula formula_135: P i=1 D i X(u i (t) -v i (t)) = m i=1 F i (t)a i + k j=1 G j (t)b j = y * .

Formula formula_136: card((u i (t), v i (t)) P i=1 ) ≤ m i=1 1(F i (t) > 0) + k j=1 1(G j (t) > 0) ≤ n + 1

Formula formula_137: Initialize C = A, f i (0) = λ i , g i (0) = 0.

Formula formula_138: C = {a i1 , a i2 , • • • a ir } ∪ {b j1 , b j2 , • • • b js }. We inductively have: 1) C is a linearly independent set. 2) f i (T ) ≥ 0 ∀i ∈ [m], g j (T ) ≥ 0 ∀j ∈ [k]. 3) f i (T ) > 0 ⇔ i ∈ {i 1 , i 2 , • • • i r }, g j (T ) > 0 ⇔ j ∈ {j 1 , j 2 , • • • j s }. 4)

Formula formula_139: f i (t) = f i (T ) = 0 if i / ∈ {i 1 , i 2 , • • • i r } f i (T ) -α λi (t -T ) if i ∈ {i 1 , i 2 , • • • i r }, t ∈ [T, T + 1/2], g i (t) = g i (T ) + αµ * i (t -T ) = αµ * i (t -T ) if i / ∈ {j 1 , j 2 , • • • j s } g i (T ) + αµ * i (t -T ) -αμ i (t -T ) if i ∈ {j 1 , j 2 , • • • j s }, t ∈ [T, T +1/2].

Formula formula_140: * jw )} > 0. Update C so that f i (T + 1/2) > 0 ⇔ a i ∈ C, g i (T + 1/2) > 0 ⇔ b i ∈ C.

Formula formula_141: s i (0) = f i (T + 1/2), i ∈ [m], z j (0) = g j (T + 1/2), j ∈ [k] and repeat: (Check) If C is linearly independent, break (Update) Say C = {a r1 , a r2 , • • • a rx } ∪ {b s1 , b s2 , • • • b sy }.

Formula formula_142: s rw (r + t) = s rw (r) -αη w t, z sw (r + t) = z sw (r) -αη ′ w t for t ∈ [0, 1]. Here α = min{min ηw>0 s rw (r)/η w , min η ′ w >0 z rw (r)/η ′ w }. At last, update C so that s i (r + 1) > 0 ⇔ a i ∈ C, z i (r + 1) > 0 ⇔ b i ∈ C. Increase r by 1. • (Construct f i , g j for t ∈ [T + 1/2, T + 1]) Concatente f i and s i , g j and z j for all i ∈ [m], j ∈ [k].

Formula formula_143: F i , G j : [0, 1] → R, simply write F i (t) = f i (T /T * ), G j (t) = g j (T /T * ).

Formula formula_144: m i=1 f i (t)a i + k j=1 g j (t)b j ,

Formula formula_145: m i=1 f i (t)

Formula formula_146: F i (1) = 0 for all i ∈ [m]. Also, as k j=1 G j (1)b j = y * ,

Formula formula_147: m i=1 1(F i (t) > 0) + k j=1 1(G j (t) > 0)

Formula formula_148: Θ * (m) := (w i , α i ) m i=1 | min (wi,αi) m i=1 L m i=1 (Xw i ) + α i , y + β 2 m i=1 (∥w i ∥ 2 2 + |α i | 2 ) ⊆ R (d+1)m ,

Formula formula_149: A min ∈ Θ * min (m) in Θ * (m).

Formula formula_150: C(t) = ( w 1 α 1 + tw 2 α 2 ∥w 1 α 1 + tw 2 α 2 ∥ 2 , ∥w 1 α 1 + tw 2 α 2 ∥ 2 s) ⊕ ( √ 1 -t w 2 α 2 ∥w 2 α 2 ∥ 2 , (1 -t)∥w 2 α 2 ∥ 2 s) ⊕ (w j , α j ) m j=3 ,

Formula formula_151: Θ * (m). i) C(t) is well-defined. First, we know ∥w 2 α 2 ∥ 2 ̸ = 0, because α 2 ̸ = 0. Also, say w 1 α 1 + (1 - t)w 2 α 2 = 0 for some t ∈ [0, 1]. Then we have DXw 1 α 1 = -(1 -t)DXw 2 α 2 , where D = diag(1(Xw 1 ≥ 0)).

Formula formula_152: DXw 1 α 1 = 0 must hold. This means w 1 = α 1 = 0 because A ∈ Θ * (m) -which is again contradiction because α 1 ̸ = 0. The well-definedness of C(t) implies that it is continuous, because it is a composition of continuous functions. ii) C(0) = A, C(1) = ( w1α1+w2α2 √ ∥w1α1+w2α2∥2 , ∥w 1 α 1 + w 2 α 2 ∥ 2 s) ⊕ (0, 0) ⊕ (w j , α j ) m j=3 from direct substitution. Note that the value m i=1 1(α i ̸ = 0) decreased by 1. iii) C(t) is a curve in Θ * (m)

Formula formula_153: P * can = (u i , v i ) P i=1 | (u i , v i ) P i=1 ∈ P * , diag(1(Xu i ≥ 0)) =D i if u i ̸ = 0, diag(1(Xv i ≥ 0)) = D i if v i ̸ = 0 . Remark D.2. diag(1(Xu ≥ 0)) = D i implies (2D i -I)Xu ≥ 0,

Formula formula_154: P * (m) → Θ * (m) as Ψ((u i , v i ) P i=1 ) := ( u i ∥u i ∥ 2 , ∥u i ∥ 2 ) ui̸ =0 ⊕ ( v i ∥v i ∥ 2 , -∥v i ∥ 2 ) vi̸ =0 ⊕ (0, 0) m-card((ui,vi) P i=1 ) , Definition D.8. Suppose m ≥ m * . We define Φ : Θ * (m) → P * (m) as Φ((w i , α i ) m i=1 ) = (u i , v i ) P i=1 := u p = i∈I w i |α i | where I = {i | α i > 0, D p = diag(1(Xw i ≥ 0))} v q = i∈I w i |α i | where I = {i | α i < 0, D q = diag(1(Xw i ≥ 0))}.

Formula formula_155: L   m j=1 (Xw j ) + α j , y   + β 2 m j=1 ∥w j ∥ 2 2 + α 2 j = L P i=1 D i X(u i -v i ), y +β P i=1 (∥u i ∥ 2 + ∥v i ∥ 2 ) ,

Formula formula_156: D p = diag(1(Xw i ≥ 0)), ∥u p ∥ 2 = i∈I ∥w i ∥ 2 |α i | = 1 2 i∈I (∥w i ∥ 2 2 + α 2 i ),

Formula formula_157: Ψ((u i , v i ) P i=1 ) := ( u i ∥u i ∥ 2 , ∥u i ∥ 2 ) ui̸ =0 ⊕ ( v i ∥v i ∥ 2 , -∥v i ∥ 2 ) vi̸ =0 ⊕ (0, 0) m-card((ui,vi) P i=1 ) . Write Φ(Ψ((u i , v i ) P i=1 )) = (u ′ i , v ′ i ) P i=1 . Let s see that u ′ i = u i for all i ∈ [P ].

Formula formula_158: (Xu j ≥ 0)) = D i . As (u i , v i ) P i=1 ∈ P * can , diag(1(Xu j ≥ 0)) = D j = D i , meaning i = j.

Formula formula_159: u j ̸ = 0 that is diag(1(Xu j / ∥u j ∥ 2 ≥ 0)) = D i , meaning u ′ i = 0. The next case is when u i ̸ = 0. For u j ̸ = 0 such that diag(1(Xu j ≥ 0)) = D i , the only possible j = i. For that j, we know that diag(1(Xu i / ∥u i ∥ 2 ≥ 0)) = D i , and u ′ i = u i / ∥u i ∥ 2 × ∥u i ∥ 2 = u i . This means u ′ i = u i for all i ∈ [P ], same for v, meaning Φ(Ψ((u i , v i ) P i=1 )) = (u i , v i ) P i=1 . Let's see Ψ • Φ. We know Φ((w i , α i ) m i=1 ) = (u i , v i ) P i=1 := u p = w i |α i | if α i > 0 and D p = diag(1(Xw i ≥ 0)), 0 otherwise v q = w i |α i | if α i < 0 and D q = diag(1(Xw i ≥ 0)), 0 otherwise, because (w i , α i ) m i=1 is minimal. Let's say m i=1 1(α i > 0) = m p , m i=1 1(α i = 0) = m z , m i=1 1(α i < 0) = m n .

Formula formula_160: j i1 ̸ = j i2 if i 1 ̸ = i 2 because j i1 = j i2 means D ai 1 = D ai 2 and a i1 = a i2 , i 1 = i 2 . Similarly, define v b1 , v b2 , • • • v bm n and v bi = w ki |α ki |. Then, Ψ(Φ((w i , α i ) m i=1 )) = w ji |α ji | w ji |α ji | , w ji |α ji | mp i=1 ⊕ w ki |α ki | w ki |α ki | , -w ki |α ki | mn i=1 ⊕ (0, 0) mz .

Formula formula_161: Ψ(Φ((w i , α i ) m i=1 )) = (w ji , |α ji |) mp i=1 ⊕ (w ki , -|α ki |) mn i=1 ⊕ (0, 0) mz . As j i1 ̸ = j i2 if i 1 ̸ = i 2 , the result is a permutation of (w i , α i ) m i=1 .

Formula formula_162: (w k j , α k j ) m j=1 in Θ * (m) that converges to (w ∞ j , α ∞ j ) m j=1 ∈ Θ * (m). Let's write Φ((w k j , α k j ) m j=1 ) = (u k i , v k i ) P i=1 , Φ((w ∞ j , α ∞ j ) m j=1 ) = (u ∞ i , v ∞ i ) P i=1 . We will show that u k i → u ∞ i .

Formula formula_163: if k ≥ M j , 1(Xw ∞ j ≥ 0) = 1(Xw k j ≥ 0) and α k j α ∞ j > 0.

Formula formula_164: w k j . Also w ∞ j ̸ = 0 implies α ∞ j ̸ = 0, meaning for sufficiently large k, α k j α ∞ j > 0 holds. For j ∈ [m] that has w ∞ j = 0, define N j (ϵ) to be the number that satisfies k ≥ N j (ϵ) implies ∥w k j α k j ∥ 2 ≤ ϵ.

Formula formula_165: ∥u k i -u ∞ i ∥ 2 ≤ ϵ for all i ∈ [P ]. For a certain i ∈ [P ], suppose there exists {j 1 , j 2 , • • • j t } ⊆ [m] that satisfies D i = diag(1(Xw ∞ j1 ≥ 0)) = • • • = diag(1(Xw ∞ jt ≥ 0)) and α ∞ j1 , • • • , α ∞ jt > 0 (hence w ∞ j1 , • • • , w ∞ jt ̸ = 0). It is clear that u ∞ i = t i=1 w ∞ ji α ∞ ji . When k ≥ max{max w ∞ j =0 N j (ϵ/m), max w ∞ j ̸ =0 M j }, we know that 1(Xw k ji ≥ 0) = 1(Xw ∞ ji ≥ 0) and α k ji > 0 for i ∈ [t]. Also, for some j ∈ [m] which is not in {j 1 , j 2 , • • • , j t } and D i = diag(1(Xw k j ≥ 0)), w ∞ j = 0. Hence, u k i = t i=1 w k ji α k ji + w ∞ j =0,Di=diag(1(Xw k j ≥0)),α k j >0 w k j α k j . u k i → u ∞ i , as w k ji → w ∞ ji , α k ji → α ∞ ji for i ∈ [t]

Formula formula_166: D i = diag(1(Xw ∞ j ≥ 0)) and α ∞ j > 0. Here, u ∞ i = 0. Now take k ≥ max{max w ∞ j =0 N j (ϵ/m), max w ∞ j ̸ =0 M j }. One thing to notice is for this k, if D i = diag(1(Xw k j ≥ 0)) and α k j > 0 for some j ∈ [m], w ∞ j = 0. Suppose w ∞ j ̸ = 0. As k ≥ M j , we know that D i = diag(1(Xw ∞ j ≥ 0)

Formula formula_167: u k i = w ∞ j =0,Di=diag(1(Xw k j ≥0)),α k j >0 w k j α k j , as k ≥ N j (ϵ/m), ∥u k i ∥ 2 ≤ ϵ. As u ∞ i = 0, we have that u k i → u ∞ i .

Formula formula_168: ϕ(t) = u i (t) ∥u i (t)∥ 2 , ∥u i (t)∥ 2 i∈I ⊕ v i (t) ∥v i (t)∥ 2 , -∥v i (t)∥ 2 i∈J ⊕ (0, 0) m-p-q .

Formula formula_169: u i (t) ̸ = 0 if u i (0) ̸ = 0, v i (t) ̸ = 0 if v i (0) ̸ = 0, ϕ is continuous for [0,1

Formula formula_170: A ′ = (w ′ j , α ′ j ) m j=1 ∈ Θ * (m) that satisfies m j=1 1(α ′ j ̸ = 0) < m. First, use Proposition D.7 to find a continuous path from A to some A min = (w • j , α • j ) ∈ Θ * min (m). If m j=1 1(α • j ̸ = 0) < m, we have found such path. If not, let's show that {(Xw • j ) + } m j=1 is linearly dependent. As all α • j ̸ = 0, all w • j ̸ = 0. Now think of Φ(A min ) = (u i , v i ) P i=1 . We can easily see that {(Xw • j ) + α • j } m j=1 = {D i Xu i } ui̸ =0 ∪ {-D i Xv i } vi̸ =0 .

Formula formula_171: • j ) + α • j } m j=1

Formula formula_172: m i=1 c i (Xw • i ) + = 0.

Formula formula_173: • 1 c 1 < 0. Define t m = min α • i ci<0 - α • i c i ,

Formula formula_174: and for t ∈ [0, t m ] define wi (t) = w • i |α • i + tc i | ∥w • i ∥ 2 , αi (t) = ∥w • i ∥ 2 |α • i + tc i | sign(α • i ).

Formula formula_175: • i + tc i ) = sign(α • i ) for t ∈ [0, t m ]. Also, m i=1 (X wi (t)) + αi (t) = m i=1 (Xw • i ) + (α • i + tc i ) = m i=1 (Xw • i ) + α • i ,and

Formula formula_176: 1 2 m i=1 ∥ wi (t)∥ 2 2 + |α i (t)| 2 = m i=1 ∥ wi (t)∥ 2 |α i (t)| = m i=1 ∥w • i ∥ 2 |α • i | + ∥w • i ∥ 2 tc i sign(α • i ).

Formula formula_177: (ν * ) T (Xw • j ) + = -β∥w • j ∥ 2 sign(α • j ), for all j ∈ [m]

Formula formula_178: m i=1 c i (Xw • i ) + = 0, multiplying (ν * ) T on both sides leads m i=1 ∥w • i ∥ 2 tc i sign(α • i ) = 0,and

Formula formula_179: 1 2 m i=1 ∥ wi (t)∥ 2 2 + |α i (t)| 2 = m i=1 ∥w • i ∥ 2 |α • i | = 1 2 m i=1 ∥w • i ∥ 2 2 + |α • i | 2 .

Formula formula_180: A to A ′ = (w ′ j , α ′ j ) m j=1 ∈ Θ * (m) where m j=1 1(α ′ j ̸ = 0) < m, we will find a path from A ′ to any permutation of A ′ , namely (w ′ σ(j) , α ′ σ(j) ) m j=1 for some permutation σ : [m] → [m].

Formula formula_181: w ′ i0 = w ′ σ(i0) , we do nothing. If w ′ i0 ̸ = w ′ σ(i0) , we first write w ′ i0 (t) = w ′ i0 √ 1 -t, α i0 (t) = α ′ i0 √ 1 -t, w ′ m (t) = w ′ i0 √ t, α ′ m (t) = α ′ i0 √ t,

Formula formula_182: w ′ i0 = 0. Next we move w ′ σ(i0) to w ′ i with w ′ i0 (t) = w ′ σ(i0) √ t, α i0 (t) = α ′ σ(i0) √ t, w ′ σ(i0) (t) = w ′ σ(i0) √ 1 -t, α ′ σ(i0) (t) = α ′ σ(i0) √ 1 -t,

Formula formula_183: w ′ σ(i0) (t) = w ′ i0 √ t, α σ(i0) (t) = α ′ i0 √ t, w ′ m (t) = w ′ i0 √ 1 -t, α ′ m (t) = α ′ i0 √ 1 -t.

Formula formula_184: Φ(A min ) to Φ(B min ), namely f : [0, 1] → P * (m) satisfying f (0) = Φ(A min ), f (1) = Φ(B min ). Write f (t) = (u i (t), v i (t)) P i=1 . Divide [0, 1] to times (t 0 = 0, t 1 ), (t 1 , t 2 ) • • • (t k-1 , t k = 1)

Formula formula_185: Ψ • f (t i ), lim t→t - i Ψ • f (t), lim t→t + i Ψ • f (t) are all permutations of each other. We construct a path from Ψ • f (0) to Ψ • f (1) as following: First, for each p = 0, 1, • • • , k -1, construct a path from lim t→t + p Ψ • f (t) to lim t→t - p+1 Ψ • f (t) by defining g(t) =      lim t→t + p Ψ • f (t) if t = t p Ψ • f (t) if t ∈ (t p , t p+1 ) lim t→t - p+1 Ψ • f (t) if t = t p+1 , for t ∈ [t p , t p+1 ]. It is clear that g is continuous. Moreover, we can connect each Ψ • f (t p ) with lim t→t + p Ψ • f (t) and lim t→t - p Ψ • f (t)

Formula formula_186: card(A) = m, Ψ(A) is an isolated point in Θ * (m). Proof. Assume the existence of a continuous function f : [0, 1] → Θ * (m) that satisfies Ψ(A) = f (0), Φ • f (1) ̸ = A. Consider the path Φ • f (t) in P * (m). As A ∈ P * (m) ∩ P * can , Φ(Ψ(A)) = A ̸ = Φ • f (1), which is a contradiction that A is an isolated point in P * (m). Hence, Ψ(A) does not have a path into Θ * (m) -Φ -1 (A).

Formula formula_187: Ψ(A) to Θ * (m) -{Ψ(A)}, proving our claim. Suppose Φ((w i , α i ) m i=1 ) = A. If diag(1(Xw i ≥ 0)) = diag(1(Xw j ≥ 0)) = D p

Formula formula_188: w i , α i ) m i=1 )) < m, which is a contradiction. Hence, (w i , α i ) m i=1 ∈ Θ * min (m).

Formula formula_189: Ψ(Φ((w i , α i ) m i=1 )) = Ψ(A) is a permutation of (w i , α i ) m i=1 . This means Φ -1 (A) is contained in a set of permutation of Ψ(A), which is finite.

Formula formula_190: A ′ ̸ = A ∈ Θ * (m).

Formula formula_191: A ∈ Θ * min (m * ) if A ∈ Θ * (m * ).

Formula formula_192: {(x 1i , x 2i , y i )} 3 i=1 = {(1, 0, 1/6), (-1/2, √ 3/2, 2/3), (-1/2, - √ 3/2, 1/6)}, X = [x 1 x 2 ] ∈ R 3×2 .

Formula formula_193: Q X = {(Xu) + | ∥u∥ 2 ≤ 1}.

Formula formula_194: (Xw i ) + α i = y, ∥w i ∥ 2 ≤ 1 ∀i ∈ [m].

Formula formula_195: ν T Xw 0 + m i=1 ν T (Xw i ) + α i = m i=1 ν T (Xw i ) + α i = ⟨ν, y⟩,and

Formula formula_196: ⟨ν, y⟩ ≤ m i=1 |ν T (Xw i ) + ||α i | ≤ m i=1 |α i |,

Formula formula_197: w 0 = - 1 3 1 0 , w 1 = 1 √ 2 1 0 , w 2 = 1 √ 2 -1/2 √ 3/2 , α 1 = 1 √ 2 , α 2 = 1 √ 2 ,

Formula formula_198: w 0 = - 1 3 -1/2 - √ 3/2 , w 1 = 1 √ 2 -1/2 - √ 3/2 , w 2 = 1 √ 2 -1/2 √ 3/2 , α 1 = 1 √ 2 , α 2 = 1 √ 2 .

Formula formula_199: 1 2 (|a + c| + |a -c| -|b + c| -|b -c|),

Formula formula_200: w 0 = 0 0 1 , w 1 =   0 √ 2 0   , w 2 =   0 - √ 2 0   , α 1 = - √ 2, α 2 = - √ 2,

Formula formula_201: w 0 = 0 0 -1 , w 1 =   √ 2 0 0   , w 2 =   - √ 2 0 0   , α 1 = √ 2, α 2 = √ 2.

Formula formula_202: -1 0 1 -1 0 1 -2 0 (a) Interpolator f (x, y) = 1 -2(y)+ -2(-y)+ -1 0 1 -1 0 1 0 2 (b) Interpolator f (x, y) = -1 + 2(x)+ + 2(-x)+

Formula formula_203: ∥w i ∥ 2 2 + α 2 i , subject to Xu + m i=1 (Xw i ) + α i = y,

Formula formula_204: v n = [ √ 3 2 , 1 2 ] T , ∥s k ∥ 2 = 1, s k > 0 ∀k ∈ [n -1], s n = [0, 1] T , v i,2 > 0 ∀i ∈ [n].

Formula formula_205: k i=1 v n-i+1 = s k . Now, choose x i = v i,1 /v i,2

Formula formula_206: (s i,1 x + s i,2 1) + ∀i ∈ [n], ((s n -v n ) 1 x + (s n -v n ) 2 1) + ,

Formula formula_207: ν i = v i,2 . Write X = [x|1] ∈ R n×2 . Then the NSB problem is written as min m,{wi,αi} m i=1 m i=1 ∥w i ∥ 2 2 + |α i | 2 subject to m i=1 ( Xw i ) + α i = y.

Formula formula_208: ∥u∥2≤1 |ν T ( Xu) + | = 1,(19)

Formula formula_209: s 1 , s 2 , • • • s n , s n -v n .

Formula formula_210: x 1 < x 2 < • • • < x n .

Formula formula_211: • • • n -1, x i-1 < - s n-i+1,2 s n-i+1,1 < x i .

Formula formula_212: ∥s n-i+1 ∥ 2 = 1 we know s n-i+1 • v i-1 = -1/2 • ∥v i-1 ∥ 2 2 < 0. Hence, s n-i+1,1 v i-1,1 + s n-i+1,2 v i-1,2 < 0, and as s n-i+1,1 , s n-i+1,2 , v i-1,2 > 0, we have s n-i+1,2 /s n-i+1,1 < -v i-1,1 /v i-1,2 = -x i-1 . Similarly, ∥s n-i+1 -v i ∥ 2 = 1, and as s n-i+1 • v i > 0, we have s n-i+1,2 /s n-i+1,1 > -v i,1 /v i,2 = -x i . This means for i = 2, 3, • • • n -1, x i-1 < x i , and x 1 < x 2 < • • • < x n-1 . At last, we have v n-1 • v n < 0 because ∥v n ∥ 2 = ∥v n + v n-1 ∥ 2 = 1, meaning x n-1 < 0, whereas x n = √ 3 > 0, meaning x 1 < x 2 < • • • < x n .

Formula formula_213: • • • , 0]), diag([0, 0, • • • , 0, 1]), diag([0, 0, • • • , 1, 1]), • • • diag([0, 1, • • • , 1, 1]), diag([1, 1, • • • , 1, 1]), • • • diag([1, 0, • • • , 0, 0]).

Formula formula_214: D 1 , D 2 , • • • D 2n . Solving equation 19 is equivalent to solving max (2Di-I) Xu≥0, ∥u∥2≤1 ν T D i Xu.

Formula formula_215: T D 2 X∥ 2 = ∥ν n [x n , 1]∥ 2 = ∥v n ∥ 2 = 1, ∥ν T D 3 X∥ 2 = ∥ν n [x n , 1] + ν n-1 [x n-1 , 1]∥ 2 = ∥v n + v n-1 ∥ 2 = 1, • • • , ∥ν T D n+1 X∥ 2 = ∥ n i=1 v i ∥ 2 = 1, ∥ν T D n+2 X∥ 2 = ∥ n-1 i=1 v i ∥ 2 = ∥[- √ 3 2 , 1 2 ]∥ 2 = 1. For D n+3 to D 2n , we can also see that ∥ν T D i X∥ 2 < 1. That is because ν T D n+k X = s n -s k-1 for k ≥ 2. ∥s n -s k-1 ∥ 2 = 2 -2s n • s k-1 , and as we know 1/2 = s n • s 1 < s n • s 2 < • • • s n • s n-1 (

Formula formula_216: ∥s n -s k-1 ∥ 2 < 1 for k = 3, 4, • • • n.

Formula formula_217: T D i Xu ≤ max i∈[2n] ∥ν T D i X∥ 2 = 1. The last thing to check is that s 1 , s 2 , • • • s n , s n -v n are actual solutions. For i = 2, 3, • • • n -1, we know that x 1 < x 2 < • • • < x i-1 < - s n-i+1,2 s n-i+1,1 < x i < x i+1 < • • • < x n ,

Formula formula_218: k i = [x i , 1] T , we have that k n • s n-i+1 > 0, • • • k i • s n-i+1 > 0, k i-1 • s n-i+1 < 0, • • • k 1 • • • s n-i+1 < 0 for all i = 2, 3, • • • n -1. Hence, for s 2 , s 3 , • • • s n-1 , (2D i+1 -I) Xs i ≥ 0.

Formula formula_219: ∥s 1 + v n-1 ∥ 2 = 1, meaning x 1 < x 2 < • • • < x n-1 < -s 1,2 /s 1,1 . Hence, k n-1 • s 1 < 0, • • • k 1 • s 1 < 0.

Formula formula_220: ∥s 1 ∥ 2 = 1, s 1 is a solution. At last, let's check that (2D n+2 -I) X(s n -v n ) ≥ 0. As v n = [ √ 3/2, 1/2] T , s n -v n = [- √ 3/2, 1/2] T , v n • (s n -v n ) < 0 and k n • (s n -v n ) < 0. For i ∈ [n -1], we know x i < 0. Hence, - √ 3x i + 1 > 0.

Formula formula_221: y = n+1 i=1 c i ( Xw i ) + ,(20)

Formula formula_222: min t subject to y ∈ tConv(Q X ∪ -Q X ),

Formula formula_223: Q X = {( Xu) + | ∥u∥ 2 ≤ 1} Pilanci &

Formula formula_224: min m,(zi,di) m i=1 m i=1 |d i |, y = m i=1 d i ( Xz i ) + , for some ∥z i ∥ 2 ≤ 1, i ∈ [m]. For any d i , z i that satisfies ∥z i ∥ 2 ≤ 1 and y = m i=1 d i ( Xz i ) + ,

Formula formula_225: ⟨ν, y⟩ = n+1 i=1 c i = m i=1 d i ν T ( Xz i ) + ≤ m i=1 |d i ν T ( Xz i ) + | ≤ m i=1 |d i |,

Formula formula_226: f ( X) = n+1 i=1 c i (a i x + b i ) + ,

Formula formula_227: {s 1 , s 2 , • • • s n , s n -v n } = {w 1 , w 2 , • • • w n+1 }. Note that s 1,1 , s 2,1 , • • • s n,1 ≥ 0 and s n,1 -v n,1 < 0.

Formula formula_228: P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R i (θ i ). (21

Formula formula_229: ) Assume θ i , s j ∈ R d , A i , B j ∈ R n×d , R i : V i → R are norms, C i , D j ⊆ R d are proper cones for i ∈ [P ], j ∈ [Q], C i ∩ V i ̸ = ∅ for i ∈ [P ]

Formula formula_230: P * gen = (c i θi ) P i=1 ⊕ (s i ) Q i=1 | c i ≥ 0, P i=1 c i A i θi + Q i=1 B i s i ∈ C y , θi ∈ Zer(F (S i , A T i ν * , -β, ⟨, ⟩)), ⟨B T i ν * , s i ⟩ = 0, s i ∈ D i ,(22)

Formula formula_231: min u∈Ci∩Vi ⟨A T i ν, u⟩ + βR i (u) = 0, min s∈Dj ⟨B T j ν, s⟩ = 0, for all i ∈ [P ], j ∈ [Q], F (S, v, -β, ⟨, ⟩) = {u | u ∈ S, ⟨v, u⟩ = -β}, S i = C i ∩ {u | R i (u) ≤ 1}

Formula formula_232: * ) = (θ * i ) P i=1 ⊕ (s * i ) Q i=1 ∈ Θ * gen . We know that P i=1 A i θ * i + Q i=1 B i s * i ∈ C y , hence it satisfies the second condition for w * = P i=1 A i θ * i + Q i=1 B i s * i . Also, consider the convex optimization problem min w,θi∈Ci∩Vi,si∈Di L(w, y) + β P i=1 R(θ i ) subject to P i=1 A i θ i + Q i=1 B i s i = w, and its Lagrangian L(w, θ, s, ν) = L(w, y) -ν T w + P i=1 (⟨A T i ν, θ i ⟩ + βR i (θ i )) + Q i=1 ⟨B T i ν, s i ⟩.(23)

Formula formula_233: min u∈C i ∩V i ⟨A T i ν,u⟩+βRi(u)=0 min u∈D i ⟨B T i ν,u⟩=0 min w L(w, y) -ν T w = max min u∈C i ∩V i ⟨A T i ν,u⟩+βRi(u)=0 min u∈D i ⟨B T i ν,u⟩=0 -f * (ν),

Formula formula_234: If θ * i = 0, we can choose c i = 0 to find c i , θi ∈ Zer(F (S i , A T i ν * , -β)). If θ * i ̸ = 0, we know that R i (θ * i ) ̸ = 0, and the vector θ * i /R i (θ * i ) satisfies θ * i /R i (θ * i ) ∈ S i and ⟨A T i ν * , θ * i /R i (θ * i )⟩ = -β. Choose c i = R i (θ * i ), θi = θ * i /R i (θ * i ) to find c i , θi ∈ Zer(F (S i , A T i ν * , -β))

Formula formula_235: P i=1 A i θ * i + Q j=1 B j s * j ∈ C y and s * i ∈ D i , ⟨B T i ν * , s * i ⟩ = 0, choose c i = 0 when θ * i = 0, c i = R(θ * i ), θi = θ * i /R(θ * i )

Formula formula_236: ∈ C i ∩ V i and s ∈ D i . If θi ̸ = 0, we know that ⟨ν * , A i θi ⟩ = -β as θi ∈ F (S i , A T i ν * , -β). Moreover, θi is the solution to min u∈Ci∩Vi,Ri(u)≤1 ⟨A T i ν * , u⟩,

Formula formula_237: i (u) = 1. Using ⟨ν * , A i θi ⟩ = -β and ⟨B T i ν * , s i ⟩ = 0, we get ⟨ν * , w ′ ⟩ = ⟨ν * , P i=1 c i A i θi + Q i=1 B i s i ⟩ = -β θi̸ =0 c i ,

Formula formula_238: P i=1 R i (c i θi ) = θi̸ =0 c i .

Formula formula_239: P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R i (θ i ) = L(w ′ , y) + β θi̸ =0 c i = L(w ′ , y) -⟨ν * , w ′ ⟩.

Formula formula_240: L(w ′ , y) -⟨ν * , w ′ ⟩ = min θi∈Ci∩Vi,si∈Di L( P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R i (θ i ).

Formula formula_241: ′ = P i=1 A i θ ′ i + Q i=1 B i s ′ i , the point (w ′ , θ ′ , s ′ ) becomes a minimizer of L(w, θ, s, ν * ). Hence, each minimizer θ ′ i is a minimizer of the problem min⟨A T i ν * , u⟩ + βR i (u) subject to u ∈ C i ∩ V i , which means that βR i (θ ′ i ) = -⟨ν * , A i θ ′ i ⟩ for all i ∈ [P ], as ν * satisfies min u∈Ci∩Vi ⟨A T i ν * , u⟩ + βR i (u) = 0.

Formula formula_242: β P i=1 R i (θ ′ i ) = -⟨ν * , w ′ ⟩,and

Formula formula_243: min θi∈Ci∩Vi,si∈Di L( P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R(θ i ) = L( P i=1 A i θ ′ i + Q i=1 B i s ′ i , y) + β P i=1 R(θ ′ i ) = L(w ′ , y) -⟨ν * , w ′ ⟩, meaning (θ, s) ∈ Θ * gen because L( P i=1 A i θ i + Q i=1 B i s i , y) + β P i=1 R(θ i ) = L(w ′ , y) -⟨ν * , w ′ ⟩.

Formula formula_244: min ui,vi∈Ki P i=1 ∥u i ∥ 2 + ∥v i ∥ 2 subject to P i=1 D i X(u i -v i ) = y,

Formula formula_245: P * := (c i ūi , d i vi ) P i=1 | c i , d i ≥ 0 ∀i ∈ [P ], P i=1 D i X ūi c i -D i X vi d i = y ⊆ R 2dP ,where

Formula formula_246: D i Xu| ≤ ∥u∥ 2 ∀u ∈ K i , i ∈ [P ].

Formula formula_247: S i = K i ∩ {u | ∥u∥ 2 ≤ 1}.

Formula formula_248: P * gen = (c i ūi , d i vi ) m i=1 | c i , d i ≥ 0, P i=1 D i X ūi c i -D i X vi d i = y, ūi ∈ Zer(F (S i , X T D i ν * , -1)), vi ∈ Zer(F (S i , -X T D i ν * , -1)) ,

Formula formula_249: min u∈Ki ⟨X T D i ν, u⟩ + ∥u∥ 2 = 0, min u∈Ki ⟨-X T D i ν, u⟩ + ∥u∥ 2 = 0,(24)

Formula formula_250: min ui,vi∈Ki P i=1 ∥u i ∥ 2 + ∥v i ∥ 2 subject to Xu 0 + P i=1 D i X(u i -v i ) = y,

Formula formula_251: P * := u 0 ⊕ (c i ūi , d i vi ) P i=1 | c i , d i ≥ 0 ∀i ∈ [P ], Xu 0 + P i=1 D i X ūi c i -D i X vi d i = y ⊆ R 2dP ,

Formula formula_252: ūi = arg min u∈Si ν * T D i Xu if min u∈Si ν * T D i Xu = -1, 0 otherwise, vi = arg min v∈Si -ν * T D i Xv if min v∈Si -ν * T D i Xv = -1, 0 otherwise.

Formula formula_253: T D i Xu| ≤ ∥u∥ 2 ∀u ∈ K i , i ∈ [P ], X T ν = 0. Here, S i = K i ∩ {u | ∥u∥ 2 ≤ 1}. When we use block notation (ν * ) T = [(ν * 1 ) T (ν * 2 ) T • • • (ν * c ) T ], u T = [(u 1 ) T , (u 2 ) T , • • • , (u c ) T ] for ν * ∈ R nc , u ∈ R dc , we can see that (ν * ) T A i u = c j=1 (ν * j ) T D i Xu j = ⟨F l -1 nc (ν * ), D i XF l -1 dc (u)⟩ M ,

Formula formula_254: F l -1 dc (F(F l dc (K i ), A T i ν * , -β, ⟨, ⟩ M )) = K i ∩ {U | ⟨F l -1 nc (ν * ), D i XU ⟩ = -β} = F(K i , X T D i N * , -β, ⟨, ⟩ M ),

Formula formula_255: F l -1 dc ( θi ) ∈F(K i , X T D i N * , -β, ⟨, ⟩ M ).

Formula formula_256: min {wi,zi} m i=1 1 2 ∥ m i=1 (Xw i ) + z T i -Y ∥ 2 2 + β 2 m i=1 ∥w i ∥ 2 2 + ∥z i ∥ 2 2 ,

Formula formula_257: w i ∈ R d×1 , z i ∈ R c×1 is given as S = (w i , z i ) m i=1 | ϕ((w i , z i ) m i=1 ) ∈ P * vec , R((w i , z i ) m i=1 ) = ∥ϕ((w i , z i ) m i=1 )∥ Ki, * , ∥w i ∥ 2 = ∥z i ∥ 2 , ∀i ∈ [m] , where ϕ((w i , z i ) m i=1 ) = (V i ) P i=1 := V p = 0 if ∄ w i s.t. D p = diag(1(Xw i ≥ 0)) tp j=1 w aj z T aj if D p = diag(1(Xw aj ≥ 0)) f or j ∈ [t p ], R((w i , z i ) m i=1 ) = (R i ) P i=1 := R p = 0 if ∄ w i s.t. D p = diag(1(Xw i ≥ 0)) tp j=1 ∥w aj ∥ 2 ∥z aj ∥ 2 if D p = diag(1(Xw aj ≥ 0)) f or j ∈ [t p ]

Formula formula_258: * i , z * i ) m i=1 in Θ * . When ϕ((w * i , z * i ) m i=1 ) = (V * i ) P i=1 , we know that m i=1 (Xw * i ) + (z * i ) T = P i=1 D i XV * i ,

Formula formula_259: P i=1 ∥V * i ∥ Ki, * ≤ m i=1 ∥w * i ∥ 2 ∥z * i ∥ 2 = 1 2 m i=1 ∥w * i ∥ 2 2 + ∥z * i ∥ 2 2

Formula formula_260: L noncvx ((w * i , z * i ) m i=1 ) ≥ L cvx (ϕ((w * i , z * i ) m i=1 )),(27)

Formula formula_261: Θ * k-1,k (Y ′ ,W ′ 1 , W ′ 2 , • • • , W ′ k-2 , W ′ k+1 , • • • W ′ L ) := θ = (W ′ i ) k-2 i=1 ⊕ (W k-1 , W k ) ⊕ (W ′ i ) L i=k+1 | θ ∈ Θ * , ( XW k-1 ) + W k = Y ′ .

Formula formula_262: + W ′ 2 ) + ) • • • W ′ k-2 ) + . The expression of Θ * k-1,k (Y ′ , W ′ 1 , W ′ 2 , • • • , W ′ k-2 , W ′ k+1 , • • • W ′ L ) is given as θ =(W ′ i ) k-2 i=1 ⊕ (W k-1 , W k ) ⊕ (W ′ i ) L i=k+1 | θ ∈ Θ * , ϕ d k (W k-1 , W k ) ∈ P * vec,intp , R d k (W k-1 , W k ) = ∥ϕ d k (W k-1 , W k )∥ Ki, * , ∥(W k-1 ) •,i ∥ 2 = ∥(W k ) i,• ∥ 2 ∀i ∈ [d k ] ,

Formula formula_263: P * vec,intp = (c i Vi ) P i=1 | c i ≥ 0, P i=1 c i D i X Vi = Y ′ , Vi ∈ Zer(F(K i , X T D i N * , -1, ⟨, ⟩ M )) ,

Formula formula_264: d k i=1 ∥u i ∥ 2 2 + ∥v i ∥ 2 2 ,

Formula formula_265: d k i=1 (Xu i ) + v T i = Y ′ ,

Formula formula_266: u i ∈ R d k-1 ×1 , v i ∈ R d k+1 ×1

Formula formula_267: (Xw i ) + z T i -Y ∥ 2 2 + β 2 m i=1 ∥w i ∥ 2 2 + ∥z i ∥ 2 2 , (28

Formula formula_268: )

Formula formula_269: β m1 i=1 c i + m2 i=1 d i = -⟨N * , Y * ⟩.

Formula formula_270: W 1 = [v 1 / ∥v 1 ∥ 2 | • • • |v m1 / ∥v m1 ∥ 2 ], w 2 = [ ∥v 1 ∥ 2 , • • • , ∥v m1 ∥ 2 ] T .

Formula formula_271: K v (D i (m 1 ), s, D ′ j ) = (v k ) m1 k=1 | (2D ik -I)s k Xv k ≥ 0, ∀k ∈ [m 1 ], (2D ′ j -I) m1 k=1 D ik Xv k ≥ 0 ,

Formula formula_272: Q X = ( P 1 m 1 ) i=1 s∈{-1,1} m 1 P2(i) j=1 m1 k=1 D ′ j D ik Xv k | (v k ) m1 k=1 ∈K v (D i (m 1 ), s, D ′ j ), m1 k=1 ∥v k ∥ 2 ≤ 1 .

Formula formula_273: (v k ) m1 k=1 ∈ K v (D i (m 1 ), s, D ′ j ), m1 k=1 ∥v k ∥ 2 ≤ 1,

