Here we provide complete proofs for the results in the main paper, as well as additional empirical results.

\section{Gradient Flow - Assumptions}
When gradient flow is implemented on non-differentiable functions (e.g., ReLU) the implementation can choose from among a set of possible sub-differentials. Here we define which of these our analysis will use. This choice corresponds to the common way gradient  methods are implemented in practice for the ReLU function.

Recall that the gradient flow step is  $\frac{d\mtht{t}}{dt} \in -\partial^{\circ}L(\mtht{t})$ for a.e. $t$, where:
\begin{equation}
\label{eq:clarke_supp}
\partial^{\circ} f(\vx) = \text{conv}\group{\lim_{k \rightarrow \infty} \nabla f(\vx_k)}{\vx_k \rightarrow \vx \text{ and $f$ is differentiable at } \vx_k }
\end{equation}
is the Clarke's subdifferential.

As discussed in the main text, we will assume that the gradient flow step selects a specific vector in the subdifferential. This is done by setting the subgradient of ReLU at $0$ in advance to a constant value $a \in [0,1]$. Namely, this value of the subgradient is used for all neurons and in all iterations. Usually $a$ is set to be either $0$ or $1$.

Formally, for each $i \in r$ we denote $\frac{d \wvec{t}_i}{d t} = \frac{1}{m}\sum_{\vx \in \sS} \frac{d \wvec{t}_i(\vx)}{d t}$. Here, $\frac{d \wvec{t}_i(\vx)}{d t}$ is the gradient update of $\wvec{t}_i$ restricted to the summand that depends on $\vx$ in $L$ (\eqref{eq:loss} of the main paper). Similarly, we denote $\frac{d \bscal{t}_i}{d t} = \frac{1}{m}\sum_{\vx \in \sS} \frac{d \bscal{t}_i(\vx)}{d t}$. For our result, we need the following technical assumption.


\begin{ass}
\label{ass:flow_ass}
There exists an $a \in [0,1]$ such that for every step $t >0$, every neuron $i \in [r]$, and every sample $\vx \in \sS$ if $\wvec{t}_i \cdot \vx + \bscal{t}_i = 0$ then $\frac{d \wvec{t}_i(\vx)}{d t} = a y \ell'\left(yN(\vx;\mtht{t})\right) \vx$ and  $\frac{d \bscal{t}_i}{d t} = a y \ell'\left(yN(\vx;\mtht{t})\right)$.
\end{ass}

\section{Proof of Theorem \ref{thm:non_converge_to_memorization}}

We first need the following notation and recall the KKT conditions in our context. Let $\partial^{\circ}\sigma(\vw_i \cdot \vx_l +b) \subseteq \mathbb{R}^{D+1}$ be the subdifferential of neuron $i$ given input $\vx_l$. It holds that:

\[\partial^{\circ}\sigma(\vw_i \cdot \vx_l +b)=
\begin{cases} 
\{(\vx_l, 1)\}  & \mbox{if } \vw_i \cdot \vx_l +b_i > 0 \\
\{(\mathbf{0}, 0)\}  & \mbox{if } \vw_i \cdot \vx_l +b_i < 0 \\
[0, 1]^{D+1} & \mbox{if } \vw_i \cdot \vx_l +b_i = 0
\end{cases}
\]

{\bf KKT conditions: } A feasible point $\mth = (\mW, \vb, c)$ of the min norm problem (\eqref{eq:maxmargin} of the main paper) is a KKT point if there exist $\lambda_1, \ldots \lambda_{m} \ge 0$ such that: 
\begin{enumerate}
    \item \textbf{Stationarity}: 
\be
\label{eq:KKT_conditions}
 \forall i \in [r], \: \: \vw_i = \sum\limits_{l \in [m]} \lambda_l y_l \vh_{il} \text{ and } b_i = \sum_{l \in [m]} \lambda_l y_l g_{il}
\ee
and
\be
 c = \sum_{j \in [m]} \lambda_j y_j
\ee

where $(\vh_{il},g_{il}) \in \partial^{\circ}\sigma(\vw_i \cdot \vx_l +b)$.
\item \textbf{Complementary slackness}: if $y_i N(\vx_i,\theta)> 1$, then $\lambda_i=0$. 
\end{enumerate}

We now proceed to prove the theorem. In the first part, we show that by Assumption \ref{ass:flow_ass}, we can restrict the possible values of $\vh_{il},g_{il}$ for a KKT point that GF converges to. In the second part, we prove properties of neurons that memorize samples. In the third part, we use the previous parts to show that memorizing solutions cannot be KKT points.

{\bf \underline{Part 1:}} In this part we prove the following lemma.

\begin{lem}
Assume that Assumption \ref{ass:flow_ass} holds and GF converges to a KKT point with parameters $\vh_{il}$ and $g_{il}$ for $1 \le i \le r$ and $1 \le l \le n$, as defined in \eqref{eq:KKT_conditions}. Then, $(\vh_{il}, g_{il}) \in \{(\vx_l, 1),  (a \vx_l, a), (0, 0)\}$.
\end{lem}
\begin{proof}

For each $1 \le l \le n$, $1 \le i \le r$ and $t > 0$ let $\left(\vh^{(t)}_{il}, g^{(t)}_{il}\right) \in \partial^{\circ}\sigma\left(\vw_i^{(t)} \cdot \vx_l +b_i^{(t)}\right)$ be the corresponding values of the GF step at time $t$ in the subdifferential $\partial^{\circ}\sigma\left(\vw_i^{(t)} \cdot \vx_l +b_i^{(t)}\right)$.
By inspecting the proofs of \citet{lyu2019gradient_sup} and \citet{dutta2013approximate_sup}, we see that $(\vh_{il}, g_{il})$ is equal to the limit of a convergent subsequence of $\left\{\left(h^{(t_j)}_{il}, g^{(t_j)}_{il}\right)\right\}_{j=0}^\infty$. 

By assumption \ref{ass:flow_ass}, we know that for each $j$, $\left(\vh^{(t_j)}_{il}, g^{(t_j)}_{il}\right) \in \{(\vx_l, 1),  (a \vx_l, a), (\mathbf{0}, 0)\}$. Therefore, the limit also satisfies $(\vh_{il}, g_{il}) \in \{(\vx_l, 1),  (a \vx_l, a), (\mathbf{0}, 0)\}$.

\end{proof}

{\bf \underline{Part 2:}}
We will first need the following definition.

\begin{defn}
\label{def:one_Hamming}
Given a sample $\hat{\vx} \in \gX$ and index $j \in [D]$, the sample with Hamming distance one from $\hat{\vx}$ at index $j$ is defined as $\gH(\hat{\vx}, j) \in \gX$ and satisfies the following:
\be 
\gH(\hat{\vx}, j)_j = -x_j \text{ and } \forall j' \in [D] \backslash \{j\} \: \: \gH(\hat{\vx}, j)_{j'} = x_{j'}
\ee
The set of all samples with  Hamming distance one from $\hat{\vx}$ is defined as: $\Psi(\hat{\vx}) = \group{\vx' \in \gX}{\exists j \in [D] \: \: \gH(\hat{\vx}, j) = \vx'}$. Note, $|\Psi(\vx)| = D$
\end{defn}

Using this definition, we rephrase Lemma \ref{lem:main_paper_memorization_properties} of the main text and prove those properties of memorizing neurons (Definition \ref{def:memorizing_neuron} of the main paper):

\begin{lem}
\label{lem:memorization_properties}
Let $D > 2$. If a neuron $i \in [r]$ memorizes a sample $\hat{\vx} \in \sS_x$, then it satisfies the following properties:
\begin{enumerate}
    \item $\hat{x}_j = \sign(w_{ij})$ for all $1 \le j \le D$.
    \item For $\vx \in \gX$ if $\vw_i \cdot \vx + b_i = 0$ then $\vx \in \Psi(\hat{\vx})$.
    \item $b_i < 0$
\end{enumerate}
\end{lem}

\begin{proof}
{\bf Property 1:} Assume by contradiction that there exists $j \in [D]$ such that $\sign(w_{ij}) \neq \hat{x}_j$. Then $\vw_i \cdot  \gH(\hat{\vx}, j) + b_i \ge \vw_i \cdot \hat{\vx} + b_i > 0$, in contradiction to the memorization assumption in \eqref{eq:memorization_condition} of the main paper.

{\bf Property 2:} Assume by contradiction that there exists $\vx \in \gX \backslash \left( \Psi(\hat{\vx}) \cup \{\hat{\vx}\}\right)$ such that $\vw_i \cdot \vx + b_i = 0$. We define $J = \group{j \in [D]}{x_j = - \hat{x}_j}$ and $j' \in J$ for some index in $J$. By Property 1,  the sample $\widetilde{\vx} = \gH(\hat{\vx}, j')$ satisfies the following:
\begin{align*}
    \vw_i \cdot \vx - \vw_i \cdot \widetilde{\vx} &  = \sum\limits_{j \in [D] \backslash J} w_{ij}  x_j + \sum\limits_{j \in J} w_{ij}  x_j - \sum\limits_{j \in [D] \backslash \{ j'\}} w_{ij}  \widetilde{x_j} -  w_{ij'}  \widetilde{x_j} \\
    & = \sum\limits_{j \in [D] \backslash J} w_{ij}  \hat{x}_j - \sum\limits_{j \in J} w_{ij}  \hat{x}_j - \sum\limits_{j \in [D] \backslash \{ j'\}} w_{ij}  \hat{x}_j +  w_{ij'}  \hat{x}_j \numberthis \\
    & = \sum\limits_{j \in [D] \backslash J} |w_{ij}| - \sum\limits_{j \in J} |w_{ij}|  - \sum\limits_{j \in [D] \backslash \{ j'\}} |w_{ij}|  + |w_{ij'}| = - 2 \sum\limits_{j \in J \backslash \{ j'\}} |w_{ij}|
\end{align*} 
Since $\vx \notin \Psi(\hat{\vx})$, it holds that $J \backslash \{ j'\} \neq \emptyset$. Furthermore, by Property 1, $\forall j \in J \backslash \{ j'\} \: \: w_{ij} \neq 0$. Thus, $\sum\limits_{j \in J \backslash \{ j'\}} |w_{ij}| > 0$ which implies that $\vw_i \cdot \vx < \vw_i \cdot \widetilde{\vx}$. We know that $\vw_i \cdot \vx + b_i = 0$, therefore $\vw_i \cdot \widetilde{\vx} + b_i > 0$. This contradicts the memorization assumption in \eqref{eq:memorization_condition} of the main paper.

{\bf Property 3:} Assume by contradiction that $b_i \ge 0$. We define $j' = \argmin_{j \in [D]} \{|w_{ij}|\}$. For the sample $\widetilde{\vx} = \gH(\hat{\vx}, j')$, the following holds by Property 1:
\begin{equation}
    \vw_i \cdot \widetilde{\vx} = \sum\limits_{j \in [D] \backslash \{j'\}} w_{ij} \hat{x}_j - w_{ij'} \hat{x}_{j'} = \sum\limits_{j \in [D] \backslash \{j'\}} |w_{ij}| - |w_{ij'}| > 0 
\end{equation}
Since $D > 2$, we have $\sum\limits_{j \in [D] \backslash \{j'\}} |w_{ij}| - |w_{ij'}| > 0$. Thus, $\vw_i \cdot \widetilde{\vx} + b_i > 0$ which contradicts the memorization assumption in \eqref{eq:memorization_condition} of the main paper.
\end{proof}

\textbf{\underline{Part 3:}} Now, we can proceed to prove Theorem \ref{thm:non_converge_to_memorization}.

Consider a network with parameters $\bar{\mth} = (\bar{\mW}, \bar{\vb}, \bar{c})$, neuron $i \in [r]$ and a sample $(\hat{\vx}, \hat{y}) \in \sS$ such that \eqref{eq:memorization_condition} of the main paper holds (i.e., the neuron $i$  memorizes the sample $\hat{x}$). We assume by contradiction that there exists an initialization $\mtht{0}$ such that if we run gradient flow from $\mtht{0}$ using $\mu$ then the weights $\mtht{t}$ will converge to $\bar{\mth}$. According to the results of \citet{lyu2019gradient_sup,ji2020directional_sup}, we know that there exists $\alpha >0$ such that $\mth = (\mW, \vb, c) = \alpha \bar{\mth}$ is a KKT point of \eqref{eq:maxmargin} of the main paper and \eqref{eq:KKT_conditions} holds for $\mth$. Note that for $\mth$, neuron $i$ memorizes the sample $(\hat{\vx}, \hat{y})$ as well. 

Given a sample $\vx_l \in \sS_x$ we can see that the following holds:
\begin{enumerate}
    \item If $\vx_l = \hat{\vx}$ then $(\vh_{il}, g_{il}) = \{(\hat{\vx}, 1)\}$.
    \item If $\vx_l \in \Psi(\hat{\vx})$ then $(\vh_{il}, g_{il}) = \{(a \vx_l, a)\}$ where $a \in
    [0,1]$ by Assumption \ref{ass:flow_ass}.
    \item If $\vx_l \notin \Psi(\hat{\vx})$ then $(\vh_{il}, g_{il}) = \{(\mathbf{0}, 0)\}$ by Property 2 of Lemma \ref{lem:memorization_properties}.
\end{enumerate}

We will show that for every $\lambda_1, \ldots, \lambda_{|\sS|} \ge 0$,  \eqref{eq:KKT_conditions} does not hold. We can assume without loss of generality that all the samples in $\vx_l \in \Psi(\vx)\cap \sS_x$ are support vectors and $(\vh_{il}, g_{il}) = (\vx_l, 1)$. This is because if one of the samples is not a support vector then we can take $\lambda_l = 0$. Furthermore, if $(\vh_{il}, g_{il}) = (\mathbf{0}, 0)$ then we can take $\lambda_l = 0$ and if $(\vh_i, g_i) = (a\vx_l, a)$ for $a > 0$ we can set $\frac{\lambda_l}{a}$ instead of $\lambda_l$. Under this assumption we can write \eqref{eq:KKT_conditions} for $\tilde{\mth}$ using only $\lambda_1, \ldots, \lambda_{D}$ and $\hat{\lambda}$ that correspond to the samples of $\Psi(\hat{\vx})$ and $\hat{\vx}$, respectively:
\be
\label{eq:KKT_conditions_restrict}
\vw_i = \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \cap \sS_x} \lambda_l y_l \vx_l  +  \hat{\lambda} \hat{y} \hat{\vx}\: \: \text{ and } \: \: b_i = \sum_{ \vx_l \in \Psi(\hat{\vx}) \cap \sS_x} \lambda_l y_l +  \hat{\lambda} \hat{y}
\ee

Assume by contradiction that there exists $\vx_{l'} =  \gH(\hat{\vx}, j')$ such that $\vx_{l'} = \Psi(\hat{\vx}) \backslash \sS_x$. The following holds by \eqref{eq:KKT_conditions_restrict} and Definition \ref{def:one_Hamming}:
\begin{align*}
    & (1) \: \:  w_{ij'}  = \hat{\lambda} \hat{y} \hat{x}_{j'} + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \cap \sS_x } \lambda_l y_l \hat{x}_{j'} \\
    & (2) \: \:  \: b_i = \hat{\lambda} \hat{y} + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \cap \sS_x} \lambda_l y_l \numberthis
\end{align*} 
Using the first property in Lemma \ref{lem:memorization_properties}:
\begin{align*}
    & (1) \: \:  w_{ij'}  = \hat{\lambda} \hat{y} \sign(w_{ij'}) + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \cap \sS_x} \lambda_l y_l \sign(w_{ij'}) \\
    & (2) \: \:  \: b_i = \hat{\lambda} \hat{y} + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \cap \sS_x} \lambda_l y_l \numberthis
\end{align*} 
Therefore, 
\begin{align*}
    & (1) \: \:  |w_{ij'}|  = \hat{\lambda} \hat{y} + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \cap \sS_x} \lambda_l y_l  \\
    & (2) \: \:  \: b_i = \hat{\lambda} \hat{y} + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \cap \sS_x} \lambda_l y_l \numberthis
\end{align*}
This means that $0 \le |w_{ij'}| = b_i $, which is in contradiction to the third property of Lemma \ref{lem:memorization_properties}. Therefore, we can assume from now on that $\Psi(\hat{\vx}) \subseteq \sS_x$, and we can write the KKT conditions as follows:
\be
\label{eq:KKT_conditions_restrict_2}
\vw_i = \sum\limits_{\vx_l \in \Psi(\hat{\vx})} \lambda_l y_l \vx_l  +  \hat{\lambda} \hat{y} \hat{\vx}\: \: \text{ and } \: \: b_i = \sum_{ \vx_l \in \Psi(\hat{\vx})} \lambda_l y_l +  \hat{\lambda} \hat{y}
\ee

Given $\vx_{l'} = \gH(\hat{\vx}, j')$, the following holds by \eqref{eq:KKT_conditions_restrict_2} and Definition \ref{def:one_Hamming}:
\begin{align*}
    & (1) \: \:  w_{ij'}  = \hat{\lambda} \hat{y} \hat{x}_{j'} + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \backslash \{ \vx_{l'}\}} \lambda_l y_l \hat{x}_{j'} - \lambda_{l_1} y_{l_1} \hat{x}_{j'} \\
    & (2) \: \:  \: b_i = \hat{\lambda} \hat{y} + \sum\limits_{\vx_l \in \Psi(\hat{\vx})} \lambda_l y_l \numberthis
\end{align*} 
Using the first property in Lemma \ref{lem:memorization_properties}:
\begin{align*}
    & (1) \: \:  w_{ij'}  = \hat{\lambda} \hat{y} \sign(w_{ij'}) + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \backslash \{ \vx_{l'}\}} \lambda_l y_l \sign(w_{ij'}) - \lambda_{l_1} y_{l'} \sign(w_{ij'}) \\
    & (2) \: \:  \: b_i = \hat{\lambda} \hat{y} + \sum\limits_{\vx_l \in \Psi(\hat{\vx})} \lambda_l y_l \numberthis
\end{align*} 
Therefore, 
\begin{align*}
\label{eq:index_KKT_cond}
    & (1) \: \:  |w_{ij'}|  = \hat{\lambda} \hat{y}  + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \backslash \{ \vx_{l'}\}} \lambda_l y_l - \lambda_{l'} y_{l'}  \\
    & (2) \: \:  \: b_i = \hat{\lambda} \hat{y} + \sum\limits_{\vx_l \in \Psi(\hat{\vx})} \lambda_l y_l \numberthis
\end{align*}
The result of subtracting $(2) - (1)$ is:
\begin{equation}
    \label{eq:lambda_form}
    b_i - |w_{ij'}| = 2 \lambda_{l'} y_{l'}
\end{equation}

Next we show that it must hold that $y_{l'} = -1$. To see this, note that the third property of Lemma \ref{lem:memorization_properties} implies $b_i < 0$ and therefore $b_i - |w_{ij'}| < 0$. Assuming in contradiction that $y_{l'} = 1$,  the RHS of \eqref{eq:lambda_form} satisfies $0 \le 2 \lambda_{l'} y_{l'}$. But we just saw that the LHS satisfies $b_i - |w_{ij'}| < 0$. We therefore have a contradiction and conclude that  $y_{l'} = -1$. We can conclude that $\Psi(\hat{\vx})$ contains only negative samples.

Next we argue that $\hat{y} = -1$. To see this, assume in contradiction that $\hat{y} = 1$. Then there exists a $n \in [K]$ such that $\hat{\vx}$ satisfies the term $t^*_n$. Due to the fact that $K \ge 2$, we know that there exists $j \in [D] \backslash \sA_n$. Then, $\gH(\hat{\vx}, j) \in \Psi(\hat{\vx})$ is a positive sample in contradiction to the fact that $\Psi(\hat{\vx})$ contains only negative samples, and therefore $\hat{y} = -1$ 

By \eqref{eq:lambda_form} we know that for every $j \in [D]$ a sample $\vx_{l} = \gH(\hat{\vx}, j)$ satisfies the following: $\lambda_{l} = \frac{1}{2}(|w_{ij}| - b_i)$. If we assign  this in \eqref{eq:index_KKT_cond}, we  get: 
\begin{equation}
        |w_{ij'}| = \hat{\lambda} \hat{y} + \sum\limits_{\vx_l \in \Psi(\hat{\vx}) \backslash \{ \vx_{l'}\}} \frac{1}{2}(b_i - |w_{ij}|) - \frac{1}{2}(b_i - |w_{ij'}|)
\end{equation}
Therefore,
\begin{equation}
        0 = \hat{\lambda} \hat{y} + \frac{1}{2}(D-2) b_i - \frac{1}{2}\normone{\vw_i} 
\end{equation}
But this results in a contradiction, because we know that $\hat{\lambda} \hat{y} \le 0$, $ \frac{1}{2}(D-2) b_i < 0$ and $- \frac{1}{2}\normone{\vw_i} < 0$.

Thus, we conclude that gradient flow cannot converge to memorizing solutions.

\section{Proof of Theorem \ref{thm:min_norm}}
\label{sec:proof_nim_norm}

We prove the theorem in several parts. We first prove properties of a perfect solution (Section \ref{sec:simple_property_global_min}). In Section \ref{sec:bias_th} we prove several results regarding the bias threshold. In Section \ref{sec:auxil} we prove auxiliary lemmas and in Section \ref{sec:alignment} we prove the alignment of the neurons of the optimal solutions to the terms of the DNF. We conclude the proof in Section \ref{sec:finish_norm_proof}. 

\subsection{A Simple Property of Perfect Solutions}
\label{sec:simple_property_global_min}

Recall the definitions, $\sS_+ = \left\{\vx \mid (\vx, 1) \in \sS\right\}$ and $\sS_- = \left\{\vx \mid (\vx, -1) \in \sS\right\}$. We first need the following definitions.
We say that a solution $\left(\mW,\vb\right)$ satisfies the $\minplus$ property if for any positive point $\vx \in \sS_+$ there exists $\sI \subseteq [r]$ such that $\sum\limits_{i \in \sI}\vw_i \cdot\vx + b_i \ge 2$. We say that a solution satisfies the $\minminus$ property if for any negative point $\vx \in \sS_-$ and for all $i \in [r]$, $\vw_i \cdot \vx + b_i \le 0$.

\begin{lem}
\label{lem:simple_property}
$(\mW, \vb)$ is a perfect solution if and only if $(\mW, \vb)$ satisfies $\minplus$ and $\minminus$.
\end{lem}
\begin{proof}
 If $\mth = (\mW, \vb)$ is a perfect solution, then for all $(\vx, y) \in \sS$, $y N(\vx; \mW, \vb)\ge 1$. Therefore, if $y=1$,  $\sum\limits_{i \in [r]} \sigma(\vw_i \cdot \vx + b_i) \ge 2$. Thus, there exists $\sI \subseteq [r]$ such that $\sum\limits_{i \in \sI}\vw_i \cdot\vx + b_i \ge 2$ and the $\minplus$ holds. If $y=-1$, then $\sum\limits_{i \in [r]} \sigma(\vw_i \cdot \vx + b_i) \le 0$ and therefore for all $i \in [r]$, $\vw_i \cdot \vx + b_i \le 0$. The other direction follows similarly.
\end{proof}
We note that one direct consequence of Lemma \ref{lem:simple_property} is that given $\mth$ if a negative $\vx \in \sS_-$ is activated by a neuron $i \in [r]$, i.e., $\vw_i \cdot \vx + b_i > 0$, then the $\minminus$ property doesn't hold, and $\mth$ is not a perfect solution.

\subsection{Proof of Lemma \ref{lem:main_paper_bias_th_lemma}}
\label{sec:bias_th}

In this section we show that when $\sS_x= \gX$, the bias of any neuron in a perfect solution is upper bounded by a certain value which we call the \textit{bias threshold}.
To simplify the formulation of this section we define the following:

\begin{defn}
\label{defn:noisy_indices}
We define the set of indices which are not active in any term as the noisy indices and denote them by $\sA_{K+1} =[D] \backslash \cup_{n \in [K]}\sA_n$.
\end{defn}


\begin{defn}
\label{defn:vn}
For each term $n \in [K]$ of $f^*$ and $i \in [r]$ define $V_n(\vw_i) = \max\left\{\min\limits_{j \in \sA_{n}} \left\{w_{ij}\right\}, 0\right\}$.
\end{defn}

\begin{defn}
\label{defn:bais_threshold}
The bias threshold for a weight $\vw$ is $BT(\vw) = - \normone{\vw} + 2 \sum\limits_{n \in [K]} V_n(\vw)$.
\end{defn}

Note, $BT(\vw) \le 0$ because every term includes at least 2 literals. Using those definitions we rephrase Lemma \ref{lem:main_paper_bias_th_lemma} and prove it:

\begin{lem}
\label{lem:bias_th_lemma}
Assume that $\sS_x= \gX$. $(\mW, \vb)$ satisfies that $\forall i \in [r] \ds b_i \le BT(\vw_i)$ and satisfies the $\minplus$ property if and only if the network is a perfect solution.
\end{lem}

\begin{proof}
We will show that given a neuron $(\vw, b)$, there is a negative sample $\vx \in \gX$ for which $\vw \cdot \vx + b > 0$ if and only if $b > BT(\vw)$. By showing that and using the assumption: $\sS_x= \gX$, we can conclude that $\forall i \in [r] \ds b_i \le BT(\vw_i)$ if and only if the $\minminus$ property holds. By Combining this with Lemma \ref{lem:simple_property}, we can prove our claim.

Given neuron $(\vw, b)$, we define the minimum index of a term $n \in [K]$ as $J_n = \argmin\limits_{j \in \sA_n} \{ w_j\}$. Consider a sample $\hat{\vx} \in \gX$ that is defined by:
\be
\hat{x}_j = \begin{cases}
        - \sign(w_j) & \exists n \in [K]: \ds V_n(\vw) > 0  \ds \land \ds j = J_n  \\
        \sign(w_j) & otherwise
        \end{cases}
\ee
For every term $n \in [K]$, if $V_n(\vw) > 0$ then  $ \hat{x}_{J_n} =  -  \sign(\vw_{J_n}) = -1$. Otherwise, $V_n(\vw) = 0$ and there exists $j \in \sA_n$ such that $w_j < 0$, i.e., $\hat{\vx}_j = \sign(w_j) = - 1$. In any case, $\hat{\vx} \cdot \vt^*_n < |\sA_n|$. Therefore, the label of this sample is negative and denote it by $\hat{y} = -1$.

We show that $\vw \cdot \hat{\vx} = - BT(\vw) $ by,

\begin{align*}
 \vw \cdot \hat{\vx} & = \sum_{j \in [D]} w_j \cdot \hat{x}_j  = \sum_{n \in [K + 1]} \Biggl[ \sum_{j \in \sA_n} w_j \cdot \hat{x}_j \Biggl] \\
& = \sum\limits_{n \in [K] \textbf{ and } V_n(\vw) > 0} \Biggl[ \sum_{j \in \sA_n \backslash \{ J_n\}} w_j \cdot \sign(w_j) - w_{J_n} \cdot \sign(w_{J_n})\Biggl]  \\
& + \sum\limits_{n \in [K] \textbf{ and } V_n(\vw) = 0} \Biggl[ \sum_{j \in \sA_n} w_j \cdot \sign(w_j)\Biggl] + \sum_{j \in \sA_{K + 1}} w_j \cdot \sign(w_j) \numberthis \\
& = \sum\limits_{n \in [K] \textbf{ and } V_n(\vw) > 0} \Biggl[ \sum_{j \in \sA_n \backslash \{ J_n\}} |w_j| - |w_{J_n}| \Biggl] + \sum\limits_{n \in [K] \textbf{ and } V_n(\vw) = 0} \Biggl[ \sum_{j \in \sA_n} |w_j|\Biggl] + \sum_{j \in \sA_{K + 1}} |w_j|   \\
& = \sum\limits_{n \in [K] \textbf{ and } V_n(\vw) > 0} \Biggl[ \sum_{j \in \sA_n } |w_j| - 2V_n(\vw) \Biggl] + \sum\limits_{n \in [K] \textbf{ and } V_n(\vw) = 0} \Biggl[ \sum_{j \in \sA_n} |w_j| - 2V_n(\vw) \Biggl]   + \sum_{j \in \sA_{K + 1}} |w_j|  \\
& = \sum\limits_{n \in [K]} \Biggl[ \sum_{j \in \sA_n} |w_j| - 2V_n(\vw) \Biggl]  + \sum_{j \in \sA_{K + 1}} |w_j|  = \sum_{j \in [D]} |w_j| - 2\sum\limits_{n \in [K]} V_n(\vw)   \\
& = \normone{\vw} - 2\sum\limits_{n \in [K]} V_n(\vw)  = - BT(\vw) 
\end{align*}

For the first direction, if $b > BT(\vw)$ then $\vw \cdot \hat{\vx} + b = BT(\vw) + b > 0$, as desired.

In the second direction, assume that there is a negative sample $\vx \in \gX$ such that $\vw \cdot \vx + b > 0$. We will show that $\vx \cdot \vw \le \hat{\vx} \cdot \vw$. Every term $n \in [K]$ satisfies for all $j \in \sA_n \backslash \{J_n \}$:
\be
\label{eq:highest_dot_product}
x_j w_j \le |w_j| = \sign(w_j) w_j = \hat{x}_jw_j
\ee
If $V_n(\vw) = 0$, then the index $j=J_n$ also satisfies \eqref{eq:highest_dot_product}, by the definition of $\hat{\vx}$. Otherwise $V_n(\vw) > 0$ and we know that there exists $j' \in \sA_n$ such that $x_{j'} = -1$ (since $\vx$ is negative), and $w_{J_n}, w_{j'} > 0$. If $J_n = j'$, then $x_{j'} w_{j'}= \hat{x}_{J_n}w_{J_n}$. Otherwise, the following holds:
\be
x_{j'} w_{j'} + x_{J_n} w_{J_n} \le - w_{j'} + w_{J_n} \le 0 \le  w_{j'} - w_{J_n} \le \hat{x}_{j'} w_{j'} + \hat{x}_{J_n} w_{J_n}
\ee
Note that every index in $j \in A_{K + 1}$ satisfies \eqref{eq:highest_dot_product} as well. Therefore,  $\vx \cdot \vw \le \hat{\vx} \cdot \vw$. We can conclude that: 
\begin{equation}
\label{eq:bias_thresh_sample}
0 < \vx \cdot \vw + b \le \hat{\vx} \cdot \vw + b = - BT(\vw) + b
\end{equation}
which implies that $b > BT(\vw)$ as desired.
\end{proof}

From this point, we will assume $\sS_x= \gX$ without mentioning it explicitly. 

\subsection{Auxiliary Lemmas}
\label{sec:auxil}

We first define a special positive sample for every term. The special sample is a sample where all indices corresponding to the term will have positive values, and all other indexes will have negative values. The special samples will be used as the hardest positive samples to satisfy the $\minplus$ property.

\begin{defn}
\label{defn:positive_only_on_specific_terms}
For a term $n \in [K]$, we define the special sample $\spclvec{n} \in \sS_x$ of this term as follows:
\be
 \forall j \in \sA_n \ds \spclscal{n}_j = 1 \text{ and } \forall j \in [D] \backslash \sA_n \ds \spclscal{n}_j = -1
\ee
We denote the set of all the special samples by $\sO = \group{\vx \in \sS_+}{ \exists n \in [K] \ds \vx = \spclvec{n}}$
\end{defn}

\begin{lem}
\label{lem:positive_weight_property}
Given $\mth = (\mW, \vb)$, assume the following conditions are satisfied:
 \begin{enumerate} 
    \item $\forall i \in [r], \ds \forall j \in [D] \ds \ds \vw_{ij} \ge 0$.
    \item For every $\vx \in \sO$ there exists $\sI \subseteq [r]$ such that $\sum\limits_{i \in \sI}\vw_i \cdot\vx + b_i \ge 2$.
\end{enumerate}
 Then $\mth$ satisfies the $\minplus$ property.
\end{lem}

\begin{proof}
Let $\vx\in \sS_+$. Then $\exists n \in [K]$ such that $\forall j \in \sA_n \ds \vx_j = 1$. By the second assumption, $\exists \sI \subseteq [r]$ such that $\sum\limits_{i \in \sI}\vw_i \cdot \spclvec{n} + b_i \ge 2$. 

For every $i \in [r]$ the following holds:
\be
\vw_i \cdot \vx = \sum\limits_{j \in [D]} w_{ij} x_j = \sum\limits_{j \in \sA_n} w_{ij} + \sum\limits_{j \in [D] \backslash  \sA_n} x_j w_{ij}
\ee
From the first condition of the claim we can deduce that
\be
 \sum\limits_{j \in \sA_n} w_{ij} + \sum\limits_{j \in [D] \backslash  \sA_n} x_j w_{ij} \ge \sum\limits_{j \in \sA_n} w_{ij} - \sum\limits_{j \in [D] \backslash  \sA_n} w_{ij} = \sum\limits_{j \in [D]} w_{ij} \spclscal{n}_j = \vw_i \cdot \spclvec{n}
\ee
Then:
\be
\sum_{i \in \sI} \sigma(\vw_i \cdot \vx + b_i) \ge \sum_{i \in \sI} \sigma(\vw_i \cdot \spclvec{n} + b_i)  \ge 2
\ee
and $\mth$ satisfies the $\minplus$ property for $\vx$ as required. 
\end{proof}

The following definition will be very useful in our analysis.
\begin{defn}
\label{defn:i_modify}
Given a min-norm solution $\mth^* = (\mW^*, \vb^*)$, we say that the a solution $\hat{\mth} = (\hat{\mW}, \hat{\vb})$ is an $i$-modified solution if the following holds:
\be
\forall i' \in [r] \backslash \{ i\} \ds \hat{\vw}_{i'} = \vw^*_{i'} \text{ and } \hat{b}_{i'} = b^*_{i'}
\ee
\end{defn}
Thus, given a min-norm solution, to define an $i$-modified solution, we only need to define the neuron $(\vw_i,b_i)$.

\begin{lem}
\label{lem:neuron_form}
Given a min-norm solution $\mth^* = (\mW^*, \vb^*)$, every $i \in [r]$ satisfies:
 \begin{enumerate} 
    \item $b_i^* = BT(\vw_i^*)$.
    \item $\forall j \in [D] \ds w_{ij}^* \ge 0$.
    \item $\exists n \in [K] \text{ such that } \forall j \in \sA_n \ds w_{ij}^* \ge 0 \text{ and } \forall j \in [D] \backslash \sA_n \ds w_{ij}^* = 0$.
\end{enumerate}
\end{lem}
\begin{proof}
\textbf{Property 1:} Assume by contradiction that $b^*_i \neq BT(\vw^*_i)$. By Lemma \ref{lem:bias_th_lemma}, $b_i$ has to be smaller than $BT(\vw^*_i)$, because otherwise $\mth^*$ is not a perfect solution. Now consider the $i$-modified solution, $\hat{\mth}$, which is defined by:
\be
\hat{\vw}_i = \vw^*_i, \ds \ds \hat{b}_i = BT(\hat{\vw})
\ee
By the assumption $\hat{b}_i > b^*_i$. Then, every $\vx \in \sS_x$ satisfies the following:
\be
\label{eq:smaller_bais}
\vx \cdot \vw^*_i + b^*_i < \vx \cdot \hat{\vw}_i + \hat{b}_i
\ee
Since $\mth^*$ satisfies the $\minplus$ property, the above implies that $\hat{\mth}$ satisfies it as well.

From Definition \ref{defn:i_modify} every $\vx \in \sS_-$ satisfies:
\be
    \forall i' \in [r] \backslash \{i\} \ds \ds  0 > \vx \cdot \vw^*_{i'} + b_{i'} = \vx \cdot \hat{\vw}_{i'} + \hat{b}_{i'}
\ee

In addition, we saw in the proof of Lemma \ref{lem:bias_th_lemma} that if $\hat{b}_i = BT(\hat{\vw})$ then $0 \ge \vx \cdot \hat{\vw}_i + \hat{b}_i$. Therefore, $\hat{\mth}$ satisfies the $\minminus$ property. By Lemma \ref{lem:simple_property}, $\hat{\mth}$ is a perfect solution. 

From Definition \ref{defn:bais_threshold} the bias threshold is nonpositive and therefore $ b^*_i< \hat{b}_i \le 0$ implies that $|\hat{b}_i| < |b^*_i|$. We know that $\hat{\vw}_i = \vw^*_i$ and therefore $\normtwo{(\hat{\vw}_i, \hat{b}_i)} < \normtwo{(\vw^*_i, b^*_i)}$ which contradicts the optimally of $\mth^*$.

\textbf{Property 2:} Assume by contradiction that $\exists j' \in [D]$ such that $\vw^*_{ij'} < 0$. Consider the following $i$-modified solution $\hat{\mth}$:
\be
\label{eq:propery_two_solution_definition}
\forall j \in [D] \backslash \{j'\} \ds \hat{\vw}_{ij} = \vw^*_{ij} \ds \land \ds  \hat{\vw}_{ij'} = 0 \ds \land \ds \hat{b}_i = BT(\hat{\vw}_i)
\ee
We want to show that:
\be
\label{eq:minimum_index_equivalent}
\sum\limits_{n \in [K]} V_n(\vw^*_i) = \sum\limits_{n \in [K]} V_n(\hat{\vw}_i)
\ee
If $\exists n' \in [K]$ such that $j' \in \sA_{n'}$, then it follows that $V_{n'}(\vw^*_i) = V_{n'}(\hat{\vw}_i) = 0$ and \eqref{eq:minimum_index_equivalent} is satisfied. Otherwise, $j' \in \sA_{K + 1}$, by Definition \ref{defn:noisy_indices}, the indices of $\sA_{K + 1}$ don't affect the value of the sums in \eqref{eq:minimum_index_equivalent}. Therefore, this equation is satisfied in this case as well.

We know that $b^*_i = BT(\vw^*_i)$ according to Property 1 above, thus every $\vx \in \sS_x$ satisfies the  following:
\begin{align*}
 \vx \cdot \vw^*_i + b^*_i & = \sum\limits_{j \in [D]} x_j w^*_{ij} + BT(\vw^*_i) \numberthis \\
& = \sum\limits_{j \in [D] \backslash \{ j' \}} x_j w^*_{ij} + x_{j'} w^*_{ij'} - |w^*_{ij'}|  - \sum\limits_{j \in [D] \backslash \{ j' \}} |w^*_{ij}| + \sum\limits_{n \in [K]}2 V_n(\vw^*_i) 
\end{align*}

We can see that  $x_{j'} w^*_{ij'} - |w^*_{ij'}| \le 0$. Then, 
\begin{align*}
& \sum\limits_{j \in [D] \backslash \{ j' \}} x_j w^*_{ij} + x_{j'} w^*_{ij'} - |w^*_{ij'}|  - \sum\limits_{j \in [D] \backslash \{ j' \}} |w^*_{ij}| + \sum\limits_{n \in [K] }2 V_n(\vw^*_i) \\
& \le \sum\limits_{j \in [D] \backslash \{ j' \}} x_j w^*_{ij} - \sum\limits_{j \in [D] \backslash \{ j' \}} |w^*_{ij}| + \sum\limits_{n \in [K]}2 V_n(\vw^*_i)  \numberthis \\
& = \sum\limits_{j \in [D]} x_j \hat{w}_{ij} - \normone{\hat{\vw}_i} + \sum\limits_{n \in [K]}2 V_n(\hat{\vw}_i) = \vx \cdot \hat{\vw_i} + BT(\hat{\vw}_i)  = \vx \cdot \hat{\vw}_i + \hat{b}_i
\end{align*}

Using the fact that $\vx \cdot \vw^*_i + b^*_i \le \vx \cdot \hat{\vw}_i + \hat{b}_i$ with the fact that $\mth^*$ satisfies the $\minplus$ property, we can conclude that $\hat{\mth}$ satisfies this property too. 

According to Property 1 above and Definition \ref{defn:i_modify} we have:
\be
\forall i' \in [r] \backslash \{ i \} \ds BT(\hat{\vw}_{i'}) = BT(\vw^*_{i'}) = b^*_{i'} = \hat{b}_{i'}
\ee
In addition, we know that $\hat{b}_i = BT(\hat{\vw}_i)$ by \eqref{eq:propery_two_solution_definition}. According to Lemma 
\ref{lem:bias_th_lemma}, we know that $\hat{\mth}$ is a perfect solution.

 From \eqref{eq:propery_two_solution_definition}, we know that $|w^*_{ij'}| > |\hat{w}_{ij'}| \rightarrow \normone{\vw^*_i} > \normone{\hat{\vw_i}}$. Combining this with  \eqref{eq:minimum_index_equivalent} and the fact that the bias threshold is nonpositive we can conclude that
 \be
 - \normone{\vw^*_i} + 2\sum\limits_{n \in [K]} V_n(\vw^*_i) < - \normone{\hat{\vw_i}} + 2\sum\limits_{n \in [K]} V_n(\hat{\vw}_i) \rightarrow \left|BT(\vw^*_i)\right| > \left|BT(\hat{\vw_i})\right| \rightarrow \left|b^*_i\right| > \left|\hat{b_i}\right|
 \ee
 Therefore,  $\normtwo{(\hat{\vw}_i, \hat{b}_i)} < \normtwo{(\vw_i^*, b_i^*)}$ in contradiction to the optimality of $\theta^*$.

\textbf{Property 3:} Assume by contradiction that there exists $i \in [r]$ such that: 
\be
\exists n_1 \neq n_2 \in [K + 1] \text{ such that } \exists j \in \sA_{n_1} \ds w_{ij}^* > 0 \text{ and } \exists j \in \sA_{n_2} \ds w_{ij}^* > 0
\ee
Without loss of generality we assume:
\be
\label{eq:samller_term_assumption}
\sum\limits_{j \in \sA_{n_1}} w_{ij}^* \ge \sum\limits_{j \in \sA_{n_2}} w_{ij}^*
\ee
Let's look on the following $i$-modified $\hat{\mth}$ which is defined by:
\be
\label{eq:i_modify_3_definition}
\forall j \in \sA_{n_2} \ds \hat{w}_{ij} = 0 \text{ and } \forall j \in [D] \backslash \sA_{n_2} \ds \hat{w}_{ij} = w^*_{ij} \text{ and } \hat{b}_i = BT(\hat{\vw}_i)
\ee
First, we will show that $\hat{b}_i \ge b^*_i $. If $n_2 \neq K +1$, from Definition \ref{defn:bais_threshold} and the assumption that $\left|\sA_{n_2}\right| > 1$, the following holds:
\be
\hat{b}_i = BT(\hat{\vw}_i) = BT(\vw^*_i) + \sum\limits_{j \in \sA_{n_2}} \left| w_{ij} \right| - 2V_{n_2}(\vw^*_i) = b^*_i + \sum\limits_{j \in \sA_{n_2}} \left|w^*_{ij}\right| - 2V_{n_2}(\vw_i) \ge b^*_i 
\ee
Otherwise $n_2 = K +1$ and from Definition \ref{defn:bais_threshold} the following holds:
\be
\hat{b}_i = BT(\hat{\vw}_i) = BT(\vw^*_i) + \sum\limits_{j \in \sA_{n_2}} \left| w_{ij} \right| = b^*_i + \sum\limits_{j \in \sA_{n_2}}\left| w^*_{ij}\right|  \ge b^*_i 
\ee
In both cases $\hat{b}_i \ge b^*_i $ as required.

Given $\widetilde{n} \in [K]$, we know that $\mth^*$ is a perfect solution and thus it satisfies the $\minplus$ property. Then, for $\spclvec{\widetilde{n}}$ there exists $\sI \subseteq [r]$ such that:
\be
\sum\limits_{i' \in \sI} \vw^*_{i'} \cdot \spclvec{\widetilde{n}} + b^*_{i'} \ge 2
\ee
We will show that there exists $\sI' \subseteq [r]$ such that:
\be
\label{eq:minplus_cond_1}
\sum\limits_{i' \in \sI'}\hat{\vw}_{i'} \cdot \spclvec{\widetilde{n}} + \hat{b}_{i'} \ge 2
\ee
Recall, by Definition \ref{defn:i_modify},  for any $i' \in [r] \backslash \{ i \}$ we know that $\hat{\vw}_{i'} \cdot \spclvec{\widetilde{n}} + \hat{b}_{i'} = \vw^*_{i'} \cdot \spclvec{\widetilde{n}} + b^*_{i'}$.

If $\widetilde{n} \neq n_2$, due to Property 1 and Property 2 above and the fact that $\hat{b}_i \ge b^*_i $ the following holds:
\begin{align*}
  \vw^*_i \cdot \spclvec{\widetilde{n}} + b^*_i & = \sum\limits_{j \in [D] \backslash \sA_{n_2}} w^*_{ij} \spclscal{\widetilde{n}}_j -  \sum\limits_{j \in \sA_{n_2}} w^*_i  + b^*_i < \sum\limits_{j \in [D] \backslash \sA_{n_2}} w^*_{ij}  \spclscal{\widetilde{n}}_j + b^*_i \le \sum\limits_{j \in [D]} \hat{w}_{ij}  \spclscal{\widetilde{n}}_j + \hat{b}_i  \\
 & = \hat{\vw}_i \cdot \spclvec{\widetilde{n}} + \hat{b}_i \numberthis 
\end{align*}
Therefore, 
\be
\sum\limits_{i' \in \sI} \hat{\vw}_{i'} \cdot \spclvec{\widetilde{n}} + \hat{b}_{i'} \ge \sum\limits_{i' \in \sI}\vw^*_{i'} \cdot \spclvec{\widetilde{n}} + b^*_{i'} \ge 2
\ee
Otherwise, $\widetilde{n} = n_2$. By the fact that $b^*_i = BT(\vw^*_i) \le 0$, Property 2 above and \eqref{eq:samller_term_assumption} the following holds:
\be
 \vw^*_i \cdot \spclvec{\widetilde{n}} + b^*_i =  \sum\limits_{j \in \sA_{n_2}} w^*_{ij} - \sum\limits_{j \in [D] \backslash \sA_{n_2}} w^*_{ij} + b^*_i \le   \sum\limits_{j \in \sA_{n_2}} w^*_{ij} - \sum\limits_{j \in \sA_{n_1}} w^*_{ij} + b^*_i  \le \sum\limits_{j \in \sA_{n_2}} w^*_{ij} - \sum\limits_{j \in \sA_{n_1}} w^*_{ij}  \le 0 
\ee

Therefore, using Definition \ref{defn:i_modify},  $\hat{\mth}$ satisfies the following:
\be
\sum\limits_{i' \in \sI \backslash \{ i\}} \hat{\vw}_{i'} \cdot \spclvec{\widetilde{n}} + \hat{b}_{i'} = \sum\limits_{i' \in \sI \backslash \{ i\}}\vw^*_{i'} \cdot \spclvec{\widetilde{n}} + b^*_{i'} \ge \sum\limits_{i' \in \sI}\vw^*_{i'} \cdot \spclvec{\widetilde{n}} + b^*_{i'} \ge 2
\ee
We can conclude that $\hat{\mth}$ satisfies \eqref{eq:minplus_cond_1}. Combining this with Property 2 above, we can see that $\hat{\mth}$ meets the condition of Lemma \ref{lem:positive_weight_property} and then it satisfies the $\minplus$ property. 

According to Property 1 above and Definition \ref{defn:i_modify}:
\be
\forall i' \in [r] \backslash \{ i \} \ds BT(\hat{\vw}_{i'}) = BT(\vw^*_{i'}) = b^*_{i'} = \hat{b}_{i'}
\ee
In addition, we know that $\hat{b}_i = BT(\hat{\vw}_i)$ by \eqref{eq:i_modify_3_definition}. According to Lemma 
\ref{lem:bias_th_lemma}, the solution $\hat{\mth}$ is a perfect solution.

As we saw $0 \ge \hat{b}_i \ge b^*_i  \rightarrow \left|\hat{b}_i\right| \le \left|b^*_i\right| $, $\forall j \in \sA_{n_2} \ds w^*_{ij} \ge 0 = \hat{w}_{ij} $ and $\exists j \in \sA_{n_2} \ds w^*_{ij} > 0 = \hat{w}_{ij} $ and therefore $\normtwo{(\vw^*_i, b^*_i)} > \normtwo{(\hat{\vw}_i, \hat{b}_i)}$ which contradicts the optimality of $\mth^*$.
\end{proof}

\subsection{Alignment Lemmas}
\label{sec:alignment}

The following three lemmas show the alignment properties of the min-norm solution.

\begin{lem}
\label{lem:neuron_align}
Given a min-norm solution $\mth^* = (\mW^*, \vb^*)$, every $i \in [r]$ either aligns with some term $n \in [K]$ or it holds that $\vw^*_i = \mathbf{0}, b^*_i = 0$ 
\end{lem}
\begin{proof}
Given $i \in [r]$, if $\vw^*_i = \mathbf{0}$ the claim is true by Property 1 of Lemma \ref{lem:neuron_form}. Otherwise, by Property 3 of Lemma \ref{lem:neuron_form}, $\exists n \in [K]$ such that:
\be 
\forall j \in \sA_n \ds \vw^*_{ij} \ge 0 \text{ and } \exists j \in \sA_n \ds \vw^*_{ij} > 0 \text{ and } \forall j \in [D] \backslash \sA_n \ds \vw^*_{ij} = 0
\ee
Assume by contradiction that:
\be
\exists j_1, j_2 \in \sA_n \text{ such that } w^*_{ij_1} \neq w^*_{ij_2}
\ee
Without loss of generality, we assume $w^*_{ij_1} > w^*_{ij_2}$ and $w^*_{ij_2} = \min\limits_{j \in \sA_n} \{ w^*_{ij} \}$.

Define the following $i$-modified solution $\hat{\mth}$:
\be
\label{eq:i_modify_4_definition}
\forall j \in \sA_n \ds \hat{w}_{ij} = w^*_{ij_2} \text{ and } \forall j \in [D] \backslash \sA_n \ds \hat{w}_{ij} = w^*_{ij} \text{ and } \hat{b}_i = BT(\hat{\vw}_i)
\ee
Note that $V_n(\vw^*_i) = V_n(\hat{\vw_i}) = \vw^*_{ij_2}$. 

Given $\vx \in \sS_+$, we know that $\mth^*$ satisfies the $\minplus$ property. Then, $\exists I \subseteq [r]$ such that:
\be
\sum\limits_{i' \in \sI}\vw^*_{i'} \cdot\vx + b^*_{i'} \ge 2
\ee
Recall, by Definition \ref{defn:i_modify},  for any $i' \in [r] \backslash \{ i \}$ we know that $\hat{\vw}_{i'} \cdot \vx + \hat{b}_{i'} = \vw^*_{i'} \cdot\vx + b^*_{i'}$

If $\vx \cdot \vt^*_n = \normone{\vt^*_n}$, due to Property 3 of Lemma \ref{lem:neuron_form} the following holds:
\begin{align*}
 \vw^*_i \cdot \vx + b^*_i & = \sum\limits_{j \in \sA_{n}} w^*_{ij} + BT(\vw^*_i) = \sum\limits_{j \in \sA_{n}} w^*_{ij} - \sum\limits_{j \in \sA_n} \left|w^*_{ij}\right| + 2V_n(\vw^*_i) = 2 \vw^*_{ij_2} \numberthis  \\
&  = \sum\limits_{j \in \sA_n} \hat{w}_{ij} - \sum\limits_{j \in \sA_n} \hat{w}_{ij} + 2 w^*_{ij_2} = \sum\limits_{j \in \sA_n} \hat{w}_{ij} - \normone{\hat{\vw}} + 2 V_n(\hat{\vw_i}) = \hat{\vw}_i \cdot \vx + \hat{b}_i
\end{align*}

Then we can conclude:
\be
\sum\limits_{i' \in \sI} \hat{\vw}_{i'} \cdot \vx + \hat{b}_{i'} = \sum\limits_{i' \in \sI}\vw^*_{i'} \cdot\vx + b^*_{i'} \ge 2
\ee

Otherwise, $\vx \cdot \vt^*_n < \normone{\vt^*_n}$ and by Definition \ref{defn:bais_threshold}:
\be
\vw^*_i \cdot \vx + b_i = \sum\limits_{j \in \sA_n} w^*_{ij} x_j  + BT(\vw^*_i) \le \sum\limits_{j \in \sA_n \backslash\{ j_2\}} w^*_{ij} -  w^*_{ij_2} - \sum\limits_{j \in \sA_{n} \backslash\{ j_2\} } w^*_{ij} -  w^*_{ij_2} + 2 \vw^*_{ij_2} = 0 
\ee

Therefore, 
\be
\sum\limits_{i' \in \sI \backslash \{ i\}} \hat{\vw}_{i'} \cdot \vx + \hat{b}_{i'} = \sum\limits_{i' \in \sI \backslash \{ i\}}\vw^*_{i'} \cdot\vx + b^*_{i'} \ge \sum\limits_{i' \in \sI}\vw^*_{i'} \cdot\vx + b^*_{i'} \ge 2
\ee
We can conclude that $\hat{\mth}$ satisfies the $\minplus$ property.

According to Property 1 of Lemma \ref{lem:neuron_form} and Definition \ref{defn:i_modify}:
\be
\forall i' \in [r] \backslash \{ i \} \ds BT(\hat{\vw}_{i'}) = BT(\vw^*_{i'}) = b^*_{i'} = \hat{b}_{i'}
\ee
In addition, we know that $\hat{b}_i = BT(\hat{\vw}_i)$ by \eqref{eq:i_modify_4_definition}. According to Lemma \ref{lem:bias_th_lemma}, the solution $\hat{\mth}$ is a perfect solution.

Finally, we can see that, $\normone{\vw^*_i} > \normone{\hat{\vw}_i}$ implies that $\left|BT(\vw_i^*)\right| > \left|BT(\hat{\vw}_i)\right|$ and $\forall j \in [D]:\, w^*_{ij} \ge \hat{w}_{ij}$. Thus, we have $\normtwo{(\vw^*_i, b^*_i)} > \normtwo{(\hat{\vw}_i, \hat{b}_i)}$. This is in contradiction to the optimally of $\mth^*$, as desired.

We can now define $\lambda_i = V_n(\vw^*_i)$ and we know that the neuron $i$ satisfies:
\be 
\forall j \in \sA_n \ds w^*_{ij} = \lambda_i \text{ and } \forall j \in [D] \backslash \sA_n \ds w^*_{ij} = 0
\ee
Therefore, $\vw^*_i = \lambda_i \vt^*_n, b^*_i = \lambda_i(2 - \normone{t^*_n})$ and we can say that neuron $i$ aligns the term $n$.
\end{proof}

\begin{lem}
\label{lem:2_neuron_align_equivallent}
Given a min-norm solution $\mth^* = (\mW^*, \vb^*)$, every 2 neurons $i_1, i_2 \in [r]$ that align with term $n \in [K]$ satisfy $\lambda_{i_1} = \lambda_{i_2}$.\footnote{$\lambda_i$ were defined in the proof of the previous lemma.}
\end{lem}

\begin{proof}
Given $i_1, i_2 \in [r]$ that align with term $n \in [K]$ we have $\vw^*_{i_1} = \lambda_{i_1} t^*_n$ and $\vw^*_{i_2} = \lambda_{i_2} t^*_n$. Assume by contradiction that $\lambda_{i_2} \neq \lambda_{i_1}$. Define the following solution $\hat{\mth}$:
\begin{align*}
\label{eq:2_align_solution}
     \forall i \in [r] \backslash \{ i_1, i_2 \} \: \:  & \hat{\vw}_i = \vw^*, \: \hat{b_i} = b^*_i \\
    & \hat{\vw}_{i_1} = \frac{\lambda_{i_1} + \lambda_{i_2}}{2} \vt^*_n, \: \:  \hat{b}_{i_1} = BT(\hat{\vw}_{i_1}) \numberthis  \\
    & \hat{\vw}_{i_2} = \frac{\lambda_{i_1} + \lambda_{i_2}}{2} \vt^*_n, \: \:  \hat{b}_{i_2} = BT(\hat{\vw}_{i_2})
\end{align*}

Note, $\forall i' \in [r] \backslash \{ i_1, i_2 \}$ we know that $\hat{\vw}_{i'} \cdot \vx + \hat{b}_{i'} = \vw^*_{i'} \cdot\vx + b^*_{i'}$

Given $\vx \in \sS_+$, we know that $\mth^*$ satisfies the $\minplus$ property. Then, $\exists \sI \subseteq [r]$ such that:
\be
\sum\limits_{i' \in \sI}\vw^*_{i'} \cdot\vx + b^*_{i'} \ge 2
\ee

If $\vx \cdot \vt^*_n = \normone{\vt^*_n}$ we can calculate the following: 
\be
\vw^*_{i_1} \cdot\vx + b^*_{i_1} + \vw^*_{i_2} \cdot\vx + b^*_{i_2} = 2\lambda_{i_1} + 2\lambda_{i_2} = \hat{\vw}_{i_1} \cdot\vx + \hat{b}^*_{i_1} + \hat{\vw}_{i_2} \cdot\vx + \hat{b}^*_{i_2}
\ee
Therefore, 
\be
\sum\limits_{i' \in \sI}\hat{\vw}_{i'} \cdot\vx + \hat{b}^*_{i'} = \sum\limits_{i' \in \sI}\vw^*_{i'} \cdot\vx + b^*_{i'} \ge 2
\ee
Otherwise, $\vx \cdot \vt^*_n < \normone{\vt^*_n}$ thus $\vw^*_{i_1} \cdot\vx + b^*_{i_1} \le 0$, $\vw^*_{i_2} \cdot\vx + b^*_{i_2} \le 0$. Then,
\be
\sum\limits_{i' \in \sI \backslash \{ i_1, i_2\}}\hat{\vw}_{i'} \cdot\vx + \hat{b}^*_{i'} = \sum\limits_{i' \in \sI \backslash \{ i_1, i_2\}}\vw^*_{i'} \cdot\vx + b^*_{i'} \ge \sum\limits_{i' \in \sI}\vw^*_{i'} \cdot\vx + b^*_{i'} \ge 2
\ee

Combining this with Property 2 of Lemma \ref{lem:neuron_form} we can see that $\hat{\mth}$ meets the condition of Lemma \ref{lem:positive_weight_property} and then it satisfies the $\minplus$ property. 

According to Property 1 of Lemma \ref{lem:neuron_form}:
\be
\forall i' \in [r] \backslash \{ i_1, i_2 \} \ds BT(\hat{\vw}_{i'}) = BT(\vw^*_{i'}) = b^*_{i'} = \hat{b}_{i'}
\ee
In addition, we know that $\hat{b}_{i_1} = BT(\hat{\vw}_{i_1})$ and $\hat{b}_{i_2} = BT(\hat{\vw}_{i_2})$ by \eqref{eq:2_align_solution}. According to Lemma  \ref{lem:bias_th_lemma}, the solution $\hat{\mth}$ is a perfect solution.

We will prove that $\sum_{i \in [r]} \normtwo{(\vw^*_i, b^*_i)} > \sum_{i \in [r]} \normtwo{(\hat{\vw}_i, \hat{b}_i)}$. This will contradict the optimality of $(\vw^*_i, b^*_i)$. Note that $\normtwo{(\vw^*_{i_1}, b^*_{i_1})} = \lambda_{i_1}^2 |\sA_n| + \lambda_{i_1}^2 (|\sA_n| - 2)^2$. Then:
 \begin{align*}
      \sum_{i \in [r]} \normtwo{(\vw^*_i, b^*_i)} - \sum_{i \in [r]} \normtwo{(\hat{\vw}_i, \hat{b}_i)} & = \normtwo{(\vw^*_{i_1}, b^*_{i_1})} + \normtwo{(\vw^*_{i_2}, b^*_{i_2})} - \normtwo{(\hat{\vw}_{i_1}, \hat{b}_{i_1})} - \normtwo{(\hat{\vw}_{i_2}, \hat{b}_{i_2})} \\
     & = \left(\lambda_{i_1}^2 + \lambda_{i_2}^2 - 2 \left(\frac{\lambda_{i_1} + \lambda_{i_2}}{2}\right)^2 \right) \left(|\sA_n| + (|\sA_n| - 2)^2\right) \\
     & =  \left(\frac{1}{2}\lambda_{i_1}^2 - \lambda_{i_1}\lambda_{i_2} +\frac{1}{2}\lambda_{i_2}^2 \right) \left(|\sA_n| + (|\sA_n| - 2)^2\right) \\
     & = \left(\frac{1}{\sqrt{2}}\lambda_{i_1} - \frac{1}{\sqrt{2}}\lambda_{i_2}\right)^2 \left(|\sA_n| + (|\sA_n| - 2)^2\right)
 \end{align*}
 
 For $\lambda_{i_1} \neq \lambda_{i_2}$, we get $\sum_{i \in [r]} \normtwo{(\vw^*_i, b^*_i)} - \sum_{i \in [r]} \normtwo{(\hat{\vw}_i, \hat{b}_i)} > 0$, as needed.
\end{proof}

\begin{lem}
\label{lem:sum_of_neuron_align}
Given a min-norm solution $\mth^* = (\mW^*, \vb^*)$, if $\sI \subseteq [r]$ is the set of all neurons that align with term $n \in [K]$, then $\sum\limits_{i \in \sS}$ $\lambda_i = 1$.
\end{lem}


\begin{proof}

Assume by contradiction that $\sum_{i \in \sI} \lambda_i \neq 1$. If $\sum_{i \in \sI} \lambda_i< 1$, then $\forall i' \in [r] \backslash \sI$, by Lemma \ref{lem:neuron_align} we know that the neuron $i'$ aligns with another term or is equal to 0, then $\spclvec{n} \cdot \vw^*_i \le 0$. By Property 1 of Lemma \ref{lem:neuron_form}, $b^*_{i'} = BT(\vw^*_{i'}) \le 0$. Therefore, $\spclvec{n} \cdot \vw^*_{i'} + b^*_{i'} \le 0$. Then, for every $\sI' \subseteq [r]$ the following holds:
\be
\sum\limits_{i \in \sI'} \vw^*_i \cdot \vx^{(n)} + b^*_i \le \sum\limits_{i \in \sI' \bigcap \sI} \vw^*_i \cdot \vx^{(n)} + b^*_i =
\sum\limits_{i \in \sI' \bigcap \sI} |\sA_n| \lambda_i + (2-|\sA_n|) \lambda_i = 2 \sum\limits_{i \in \sI' \bigcap \sI}  \lambda_i < 2    \numberthis
\ee
and thus $\mth^*$ doesn't satisfy the $\minplus$ property. By Lemma \ref{lem:simple_property}, this contradicts the fact that $\mth^*$ is a perfect solution.

If $\sum_{i \in \sI} \lambda_i> 1$, we choose an arbitrarily  $\hat{i} \in \sI$. Define $\hat{\mth}$ as follows:
\begin{align*}
\label{eq:better_solution_definition}
\forall i \in [r] \backslash \sI \ds & \hat{\vw}_i = \vw^*_i \text{, } \ds \hat{b}_i = b^*_i \\
 \forall i \in \sI \backslash \{\hat{i}\}  \ds &\hat{\vw}_i = 0 \text{, } \ds \hat{b}_i = 0 \numberthis \\
& \hat{\vw}_{\hat{i}} = \vt^*_n \text{, } \ds \hat{b}_{\hat{i}} = BT(\hat{\vw}_{\hat{i}})   
\end{align*}

Given $\spclvec{\widetilde{n}} \in \sO$, we know that $\mth^*$ satisfies the $\minplus$. Thus, $\exists \sI' \subseteq [r]$ such that:
\be
\label{eq:minplus_cond}
\sum\limits_{i \in \sI'}\vw^*_{i'} \cdot\spclvec{\widetilde{n}} + b^*_i \ge 2
\ee

We will show that there exists $\hat{\sI} \subseteq [r]$ such that:
\be
\label{eq:minplus_cond_2}
\sum\limits_{i' \in \hat{\sI}}\hat{\vw}_{i'} \cdot \spclvec{\widetilde{n}} + \hat{b}_{i'} \ge 2
\ee

If $\widetilde{n} = n$, then:
\be
\hat{\vw}_{\hat{i}} \cdot \spclvec{\widetilde{n}} + \hat{b}_{\hat{i}}  = \vt^*_n \cdot \spclvec{\widetilde{n}} + BT(\vt^*_n) = \normone{\vt^*_n} - \normone{\vt^*_n} + 2 = 2
\ee
Therefore, by choosing $\hat{\sI} = \{\hat{i}\}$, we can show that $\hat{\mth}$ satisfies \eqref{eq:minplus_cond_2}.

Otherwise, $\widetilde{n} \neq n$ and then every $i \in \sI$ satisfies:
\be
\vw^*_i \cdot \spclvec{\widetilde{n}} + b^*_i =  - \lambda_i |\sA_n| - \lambda_i (|\sA_n| - 2) < 0
\ee 
From \eqref{eq:better_solution_definition} we can conclude:
\be
\sum\limits_{i \in \sI' \backslash \sI}\hat{\vw}_i \cdot\spclvec{\widetilde{n}} + \hat{b}_i = \sum\limits_{i \in \sI' \backslash \sI} \vw^*_i \cdot\spclvec{\widetilde{n}} + b^*_i \ge \sum\limits_{i \in \sI'}\vw^*_i \cdot\spclvec{\widetilde{n}} + b^*_i \ge 2
\ee
Therefore, $\hat{\mth}$ satisfies \eqref{eq:minplus_cond_2} for $\spclvec{\widetilde{n}}$. In addition, by Property 2 of Lemma \ref{lem:neuron_form} and \eqref{eq:better_solution_definition}, we know that all the weights of the neurons in $\hat{\mth}$ are nonnegative. Then, $\hat{\mth}$ meets the condition of Lemma \ref{lem:positive_weight_property} and it satisfies the $\minplus$ property.
 
 According to Property 1 of Lemma \ref{lem:neuron_form} and \eqref{eq:better_solution_definition}:
\be
\forall i \in [r] \backslash \sI \ds BT(\hat{\vw}_i) = BT(\vw^*_i) = b^*_i = \hat{b}_i
\ee
In addition, we know that $\forall i \in \sI \backslash \{\hat{i}\} \ds \ds \hat{b}_i = 0 = BT(\hat{\vw}_i)$ and $\hat{b}_{\hat{i}} = BT(\hat{\vw}_{\hat{i}})$. Therefore, according to Lemma 
\ref{lem:bias_th_lemma}, the solution $\mth$ is a perfect solution. 

We will prove that $\sum_{i \in [r]} \normtwo{(\vw^*_i, b^*_i)} > \sum_{i \in [r]} \normtwo{(\hat{\vw}_i, \hat{b}_i)}$. This will contradict the optimality of $\mth^*$. Indeed:
 \begin{align*}
      \sum_{i \in [r]} \normtwo{(\vw^*_i, b^*_i)} - \sum_{i \in [r]} \normtwo{(\hat{\vw}_i, \hat{b}_i)} & = \sum_{i \in I} \normtwo{(\vw^*_i, b^*_i)} - \normtwo{(\hat{\vw}_{\hat{i}}, \hat{b}_{\hat{i}})} = \numberthis \\
     & \sum_{i \in I} \lambda_i^2|\sA_n| + \lambda_i^2(|\sA_n| - 2)^2 - |\sA_n| - (|\sA_n| - 2)^2 = \\
    & \left(\sum_{i \in I} \lambda_i^2  - 1\right)\left( |\sA_n| + (|\sA_n| - 2)^2 \right)
 \end{align*}
 Since $\sum_{i \in \sI} \lambda_i> 1$, we have $\sum_{i \in [r]} \normtwo{(\vw^*_i, b^*_i)} - \sum_{i \in [r]} \normtwo{(\hat{\vw}_i, \hat{b}_i)} > 0$, which completes the proof.
 
\end{proof}

\subsection{Finishing the Proof of Theorem \ref{thm:min_norm}}
\label{sec:finish_norm_proof}

\begin{proof}
Given a min-norm solution $\mth^* = (\mW^*, \vb^*)$, by Lemma \ref{lem:neuron_align}, each neuron $i \in [r]$ aligns with some term $n_i \in [K]$ or it is equal to 0. Assume by contradiction that there exists a term $n \in [K]$ that is not aligned, namely $\forall i \in [r] \ds n_i \neq n$. Consider the special positive sample $\spclvec{n} \in \sS_+$. From the definition of $\spclvec{n}$, every $j \in [D] \backslash \sA_n$ satisfies $\spclvec{n}_j = -1$. Then, 
\be
\forall i \in [r] \ds \ds \spclvec{n} \cdot \vw^*_i =  \lambda_n \spclvec{n} \cdot \vt^*_{n_i} = - \lambda_n \normone{\vt^*_{n_i}} < 0
\ee
By the first property of Lemma \ref{lem:neuron_align}, $\forall i \in [r], \ds b^*_i = BT(\vw^*_i) \le 0$. Therefore, $\forall i \in [r]: \ds \spclvec{n} \cdot \vw^*_i + b^*_i < 0$, in contradiction to the fact that $\mth^*$ is a perfect solution.

In conclusion, every term $n \in [K]$ aligns with a set of neurons $\sI \subseteq [r]$. By Lemma \ref{lem:sum_of_neuron_align}, we know that $\sum\limits_{i \in \sI} \lambda_i = 1$. By Lemma \ref{lem:2_neuron_align_equivallent}, we know that every two neurons $i_1, i_2$ that align with the same term satisfy $\lambda_{i_1} = \lambda_{i_2}$. Therefore, $\mth^*$ is a DNF recovery solution.
\end{proof}


\begin{figure}[t!]

\hspace*{\fill}
\begin{subfigure}{.49\textwidth}
  \centering
  % include first image
  \includegraphics[width=\linewidth]{figures/D=9_comparsion.png}
  \caption{}
  \label{fig:D=9_comparsion}
\end{subfigure}
\begin{subfigure}{.49\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=40_comparsion.png}  
  \caption{}
  \label{fig:D=40_comparsion}
\end{subfigure}
\begin{subfigure}{.49\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=50_comparsion.png}  
  \caption{}
  \label{fig:D=50_comparsion}
\end{subfigure}
\begin{subfigure}{.49\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=100_comparsion.png}  
  \caption{}
  \label{fig:D=100_comparsion}
\end{subfigure}
\hspace*{\fill}
\caption{Test accuracy for the convex network with small initialization, convex network with large initialization and standard networks. Figure (a) shows the performance when learning $f^*_1$, Figure (b) for $f^*_3$, Figure (c) for $f^*_4$ and Figure (d) for $f^*_3$ (Results for $f^*_2$ were presented in the main paper).}
\label{fig:all_compersion}
\end{figure} 

\begin{figure}[t]
\hspace*{\fill}
\begin{subfigure}{.24\textwidth}
  \centering
  % include first image
  \includegraphics[width=\linewidth]{figures/D=9_cluster_100.png}
  \caption{}
  \label{fig:D=9_cluster_100}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=9_cluster_250.png}  
  \caption{}
  \label{fig:D=9_cluster_250}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=9_cluster_500.png}  
  \caption{}
  \label{fig:D=9_cluster_500}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=9_cluster_500_memorization.png}  
  \caption{}
  \label{fig:D=9_cluster_500_memorization}
\end{subfigure}
\hspace*{\fill}
\caption{(a-c) Effect of training size on the learned model (see {\bf Learned models for different training sizes} in the text) for the ground-truth model $f^*_1$. Panels a-c correspond to training sizes 800,1500 and 7500. (d) Result for training on 7500 with large initialization. 
\newline}
\label{fig:D=9_cluster}
\end{figure} 


\begin{figure}[t]
\hspace*{\fill}
\begin{subfigure}{.24\textwidth}
  \centering
  % include first image
  \includegraphics[width=\linewidth]{figures/D=25_cluster_800.png}
  \caption{}
  \label{fig:D=25_cluster_800}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=25_cluster_1500.png}  
  \caption{}
  \label{fig:D=25_cluster_1500}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=25_cluster_7500.png}  
  \caption{}
  \label{fig:D=25_cluster_7500}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=25_cluster_7500_memorization.png}  
  \caption{}
  \label{fig:D=25_cluster_7500_memorization}
\end{subfigure}
\hspace*{\fill}
\caption{Same as \figref{fig:D=9_cluster} but with $f^*_2$ as ground truth.\newline}
\label{fig:D=25_cluster}
\end{figure} 

\begin{figure}[t!]
\hspace*{\fill}
\begin{subfigure}{.24\textwidth}
  \centering
  % include first image
  \includegraphics[width=\linewidth]{figures/D=40_cluster_1000.png}
  \caption{}
  \label{fig:D=40_cluster_1000}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=40_cluster_2500.png}  
  \caption{}
  \label{fig:D=40_cluster_2500}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=40_cluster_10000.png}  
  \caption{}
  \label{fig:D=40_cluster_10000}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=40_cluster_10000_memorization.png}  
  \caption{}
  \label{fig:D=40_cluster_10000_memorization}
\end{subfigure}
\hspace*{\fill}
\caption{Same as \figref{fig:D=9_cluster} but with $f^*_3$ as ground truth.\newline}
\label{fig:D=40_cluster}
\end{figure} 

\begin{figure}[t!]
\hspace*{\fill}
\begin{subfigure}{.24\textwidth}
  \centering
  % include first image
  \includegraphics[width=\linewidth]{figures/D=50_cluster_2500.png}
  \caption{}
  \label{fig:D=50_cluster_2500}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=50_cluster_8000.png}  
  \caption{}
  \label{fig:D=50_cluster_8000}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=50_cluster_30000.png}  
  \caption{}
  \label{fig:D=50_cluster_30000}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=50_cluster_30000_memorization.png}  
  \caption{}
  \label{fig:D=50_cluster_30000_memorization}
\end{subfigure}
\hspace*{\fill}
\caption{Same as \figref{fig:D=9_cluster} but with $f^*_4$ as ground truth.\newline}
\label{fig:D=50_cluster}
\end{figure} 

\section{Experiment Details and Additional Results}
\label{sec:technical_details}

\paragraph{Selected read-once DNFs: }In this section we provide additional experiments  for the following types of read-once DNFs.
\begin{itemize}
    \item \textbf{$f^*_1$ - } 3-term read-once DNF where the length of every term is 3 for $D=9$
    \item \textbf{$f^*_2$ - } 4-term read-once DNF where the length of the terms are 4,5,5,6 for $D=25$
    \item \textbf{$f^*_3$ - } 8-term read-once DNF where the length of the terms are 3,3,4,4,4,4,5,5 for $D=40$
    \item \textbf{$f^*_4$ - } 10-term read-once DNF where the length of the terms are 3,3,4,4,4,4,6,6,6,6 for $D=50$
    \item \textbf{$f^*_5$ - } 15-term read-once DNF where of every term is 5 for $D=100$
\end{itemize}

\paragraph{General details: } In all the experiments, ``small initialization'' refers to initializing weights from $\wmat{0} \sim \mathcal{N}(0, 10^{-6})$ and $\bvec{0} = [0]^D$. The learning rate for SGD is $\eta = 10^{-3}$, the number of hidden units is $r=2000$ and the batch size is 32. We create the train set by sampling uniformly from $[\pm1]^D$. All the experiments can run on any single GPU. Training a single network can take up to two hours.

\newpage

\paragraph{Weight Matrix Visualization: } When presenting weight matrices, we first cluster the neurons using the Hierarchical clustering algorithm.\footnote{We used scipy.cluster.hierarchy.linkage. with centroid as a method.} We then plot the weight values in an image, where neurons clustered together appear in consecutive rows. Note this of course does not change the model itself, but makes it easy to see if there are well clustered neurons (as in the DNF recovery case.


\paragraph{Sample Complexity Experiments:}  We evaluate test accuracy as a function of the training sample size for different models.  Results are shown in \figref{fig:all_compersion}. Specifically, we compare the convex network with small initialization (see details above),  the convex network with large initialization (we take $\wmat{0} \sim \mathcal{N}(0, 1)$), and a ``standard'' network with one hidden layer, same width as the convex network and Xavier initialization (we checked different initialization schemes, including small Gaussian initialization, and verified that this not affect the results). We run every experiment $10$ times and present the mean performance and the std of this mean (the std is small and smaller than the line width).

\paragraph{Implementation of the Statistical Queries (SQ) Methods: } In  \figref{fig:D=9_comparison} of the main paper, we present results of the statistical query algorithm. We implemented the algorithm described in \cite{mansour2001entropy_sup}. We view the $\epsilon$ therein as a hyperparameter. Therefore, we use 10\% of the train set as validation for finding $\epsilon$. We present the performance of the SQ algorithm only for $D=9$, because for larger dimension the algorithm failed in creating a DNF for the range of train set sizes tested.

\paragraph{Learned models for different training sizes:} In the main paper, we show empirically that learning convex networks with small init and GD leads to a DNF recovery solution, and we also show formally that in the population risk DNF recovery is norm minimizing. Here we show explicit model weights for different training sizes, demonstrating that approximate DNF recovery solutions are obtained for fairly small sample sizes. Figures \ref{fig:D=9_cluster},\ref{fig:D=25_cluster},\ref{fig:D=40_cluster},\ref{fig:D=50_cluster},\ref{fig:D=100_cluster} (panels a-c) show these results. In panel d of these figures we show the learned model for when learning with large Gaussian initialization and with the same train set size. It can be seen that larger initialization does not results in the recovery-DNF solution (note we are also visualizing these solutions using clustering as explained above, and there is clearly no cluster structure in the solution).

 \paragraph{DNF reconstruction:} In \figref{fig:D=9_reconstruction} of the main paper we present accuracy results for DNF reconstruction. To obtain these, we take the learned model and apply a simple rounding procedure to check if this model reconstructs the ground-truth DNF. The procedure is outlined in Algorithm \ref{alg:cap}. In the procedure, we create a $\{0, 1\}$ matrix $\mW'$ where the column indices of $1$s in each row correspond to a term of a DNF. Thus, $\mW'$ represents a set of terms. If the set of terms of $\mW'$ is exactly the set of terms of the input DNF, the procedure returns True.  In our experiments we ran the procedure with inputs $A = [0, 0.1, 0.2, \ldots , 0.9]$ and $B = [0, 0.2, 0.4, \ldots , 0.8]$.
 
\begin{algorithm}[t!]
\caption{Reconstruction Procedure}\label{alg:cap}
\hspace*{\algorithmicindent} \textbf{Input:} Network $\mth = (\mW, \vb, c)$, DNF $f^*$, fixed sets  $A, B \subseteq [0,1]^L$.\\
 \hspace*{\algorithmicindent} \textbf{Output:} True if the network with parameter $\mth$ reconstructs DNF $f^*$, False otherwise. 
\begin{algorithmic}
\For{$(a, b) \in A \times B$}
\State $\mW' \gets [0]^{rD}$ \Comment{$\mW'$ will be a $\{0, 1\}$ matrix where each row corresponds to a term in a DNF.} 
\For{$1 \le i \le r$}
\If{$\ell_{\infty}(\vw_i) \ge a * \ell_{\infty}(\mW)$} \Comment{Taking into account only meaningful neurons} 
\For{$1 \le j \le r$}
\If{$w_{ij} \ge \ell_{\infty}(\vw_i) * b$} \Comment{Taking into account only meaningful values} 
\State $w'_{ij} \gets 1$
\EndIf
\EndFor
\EndIf
\EndFor 
\If{The set of terms represented by $\mW'$ is exactly the set of terms of the DNF $f^*$}
\State \Return True
\EndIf
\EndFor
\State \Return False
\end{algorithmic}
\end{algorithm}

\paragraph{The effect of learning $c$:} \figref{fig:all_fix_c} shows that fixing the learnable parameter $c$ to $-1$ does not effect the structure of the solution. In this experiment, we took 2 networks with the same width: One with learnable $c$ initialized to $0$, and the second with fixed $c=-1$. We initialize the other weights with the same values, and train them with the same train set for the same number of steps. Finally we plot the solution that the network learns. We can say they both recover the underlying DNF.

 \paragraph{Tabular datasets:} We consider the three UCI datasets: kr-vs-kp, Splice, and diabetes. For these, we convert the input into binary by changing categorical variables to one-hot. We also consider binary classification such that in kr-vs-kp the class 'won' is positive considered and 'notwon' is negative, in Splice the classes 'EI' and 'IE' are considered positive and 'N' negative, and diabetes is binary by design.  We train on $90\%$ of the data and test on $10\%$. The reconstruction process is identical to algorithm \ref{alg:cap} when instead of validate if $\mW'$ is identical to $f^*$, we return the $\mW'$ with the best accuracy on the train set.

The relevant code can be found in our repository: \url{https://github.com/idobronstein/Exploring-the-Inductive-Bias-of-Neural-Networks-for-Learning-Read-once-DNFs.git}.

\begin{figure}[t!]
\hspace*{\fill}
\begin{subfigure}{.24\textwidth}
  \centering
  % include first image
  \includegraphics[width=\linewidth]{figures/D=100_cluster_6000.png}
  \caption{}
  \label{fig:D=100_cluster_6000}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=100_cluster_15000.png}  
  \caption{}
  \label{fig:D=100_cluster_15000}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=100_cluster_40000.png}  
  \caption{}
  \label{fig:D=100_cluster_40000}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=100_cluster_40000_memorization.png}  
  \caption{}
  \label{fig:D=100_cluster_40000_memorization}
\end{subfigure}
\caption{Same as \figref{fig:D=9_cluster} but with $f^*_5$ as ground truth.\newline}
\label{fig:D=100_cluster}
\end{figure} 

\begin{figure}[t!]
\begin{subfigure}{.24\textwidth}
  \centering
  % include first image
  \includegraphics[width=\linewidth]{figures/D=9_normal.png}
  \caption{}
  \label{fig:D=9_normal}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=9_fix.png}  
  \caption{}
  \label{fig:D=9_fix}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=50_normal.png}  
  \caption{}
  \label{fig:50_normal}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/D=50_fix.png}  
  \caption{}
  \label{fig:50_fix}
\end{subfigure}
\hspace*{\fill}
\caption{(a) Learning $f^*_1$ using convex network with learnable $c$ and 250 train samples. (b) Learning $f^*_1$ using convex network with fix $c$ to $-1$ and 250 train samples. (c) Learning $f^*_4$ using convex network with learnable $c$ and 30,000 train samples. (d) Learning $f^*_4$ using convex network with fix $c$ to $-1$ and 30,000 train samples.}
\label{fig:all_fix_c}
\end{figure} 