\section{DNF Recovery as Norm Minimization}
\label{sec:theory}

\begin{figure*}[t!]
\hspace*{\fill}
\begin{subfigure}{.32\textwidth}
  \centering
  % include first image
  \includegraphics[width=\linewidth]{figures/D=25_comparsion.png}  
  \caption{}
  \label{fig:comparison}
\end{subfigure}
\hspace*{\fill}
\begin{subfigure}{.32\textwidth}
  \centering
  % include second image
  \includegraphics[width=\linewidth]{figures/D=100_cluster_readonce.png}
  \caption{}
  \label{fig:cluster_readonce}
\end{subfigure}
\hspace*{\fill}
\begin{subfigure}{.32\textwidth}
  \centering
  % include second image
  \includegraphics[width=\linewidth]{figures/D=100_cluster_40000_memorization.png}  
  \caption{}
  \label{fig:cluster_readonce_memorizarion}
\end{subfigure}
\caption{(a) {\bf Accuracy Comparison for Different Models}: Test performance of a convex network with small Gaussian initialization, a convex network with large Gaussian initialization, and a standard two-layer networks with Xavier initialization. Here $D=25$ and the target DNF has 4 terms of sizes $4$, $5$, $5$ and $6$. Each dot corresponds to the mean of 10 experiments with different initializations. (b) {\bf SGD Learns a DNF recovery solution for $D=100$:} The last 25 dimensions are noisy inputs whose corresponding literals do not appear in the DNF. \footref{foot:weight_figure_explanation} (c) {\bf SGD With large initialization doesn't learn a DNF recovery solution:} The target function is the same as in Figure \ref{fig:cluster_readonce}. The only different is the initialization size. \footref{foot:weight_figure_explanation}} 
\label{fig:exps}
\end{figure*} 

In the previous section we have shown that GF does not converge to memorization solutions. This result followed since GF converges to KKT points of the minimum norm problem and memorization solutions are not KKT. However, we would like to understand what are the KKT points that GF does converge to.

To address this question, we focus on characterizing KKT points which are also \textit{global minimizers} of the minimum norm problem. The reason is that there is growing theoretical evidence that GF is biased towards solutions that minimize norms (or analogously, solutions that maximize margins). This has been shown for logistic regression \citep{soudry2018implicit} and linear networks \citep{ji2018gradient}.

In a setting which is closest to ours, \citet{chizat2020implicit}  show that GF trained on two-layer nonlinear networks converges to maximum margin solutions. Their result holds for infinite 2-homogeneous networks with squared ReLU activations. Therefore, it cannot be applied in our setting. Nonetheless, all of the aforementioned results provide a strong motivation to study minimum norm solutions in our setting, to better understand the inductive bias of GF.


Analyzing the global solutions of the  optimization problem in \eqref{eq:maxmargin} is a major challenge since the problem is nonconvex. To make headway, in this section we analyze these solutions under two  technical assumptions (see Assumption \ref{assump:cminus1} and Assumption \ref{assump:population}). We define DNF recovery solutions (see Definition \ref{def:recovery}) as solutions which are aligned with the terms of the DNF (similar to Figure \ref{fig:gd_global_minimum}). We then prove our main result: that a network that globally optimizes \eqref{eq:maxmargin} is a DNF recovery solution. This means that if GF globally optimizes \eqref{eq:maxmargin} then it must converge to a DNF recovery solution. Furthermore, this result is in line with our experiments in Section \ref{sec:empirical} which show that GF converges to DNF recovery solutions.

We next formally define a DNF recovery solutions. We first define alignment of a neuron with a term.

\begin{defn}
A neuron $i \in [r]$ is an aligning neuron with respect to a DNF $f^*$, if there exists $n \in [K]$ and $\lambda_i > 0$ such that $\vw_i = \lambda_i \vt_n^*$, and $b_i = \lambda_i(2-\left\|\vt_n^*\right\|_1)$. We refer  to the neuron $i$ as aligning with the term $n$, and to $\lambda_i$ as the alignment coefficient of $i$.
\end{defn}

Next we define a DNF recovery solution.
\begin{defn}
\label{def:recovery}
$\mth$ is a DNF-recovery solution if $\forall n \in [K]$ there exists a set of neurons $\sI$ such that every $i \in \sI$ aligns with term $n$, $\sum_{i \in \sI} \lambda_i = 1$, $\forall i_1 , i_2 \in \sI \: \: \lambda_{i_1} = \lambda_{i_2}$ and all other neurons are zero.
\end{defn}

Thus, DNF recovery solutions are networks where all terms in $f^*$ have corresponding neurons aligned with them. Furthermore, all other neurons are zero. In other words, a recovery solution encodes the DNF explicitly in the weights of the network (and thus the DNF can be easily recovered from the weights). The conditions on the $\lambda_i$ are required for the global optimality results to hold. We note that any DNF recovery solution perfectly classifies the data.

Next, we state our technical assumptions:

\begin{ass}
\label{assump:cminus1}
The output layer bias is fixed to $c=-1$, and $(\mW, \vb)$ are learned. % and the learnable parameters are $\mth = (\mW, \vb)$.
\end{ass}
\begin{ass}
\label{assump:population}
$\sS_{\vx} = \gX$, i.e., we are in the population setting.
\end{ass}

Assumption \ref{assump:cminus1} does not limit the expressive power of the network (see Theorem \ref{sec:expressive}) and in the supplementary we show qualitatively that fixing $c=-1$ does not change the inductive bias of GF.\footnote{Surprisingly, without this assumption the theoretical analysis becomes substantially more difficult. We leave the extension to any $c$ for future work.} 

Many works have studied the population setting (Assumption \ref{assump:population}) as a proxy to understand the performance in the empirical case \citep{daniely2020learning, brutzkus2017globally}. The population setting is a good test-bed to understand the inductive bias of GF since the loss has multiple zero training error solutions in this case (e.g., memorization and DNF recovery solutions).

We now state the main result of this section.

\begin{thm}\label{thm:min_norm}
Consider the minimum norm optimization problem \eqref{eq:maxmargin} when learning a read-once DNF under Assumption \ref{assump:cminus1} and Assumption \ref{assump:population}.
Then, any globally optimal solution $\mth^* = (\mW^*, \vb^*)$ of \eqref{eq:maxmargin} is a DNF-recovery solution.
\end{thm}

\begin{figure*}[t!]
\hspace*{\fill}
\begin{subfigure}{.4\textwidth}
  \centering
  % include second image
  \includegraphics[width=\linewidth]{figures/D=100_overlap.png}  
  \caption{}
  \label{fig:overlap}
\end{subfigure}
\hspace*{\fill}
\begin{subfigure}{.4\textwidth}
  \centering
  % include second image
  \includegraphics[width=\linewidth]{figures/D=20_cluster_overlap.png}  
  \caption{}
  \label{fig:cluster_overlap}
\end{subfigure}
\hspace*{\fill}
\caption{(a) {\bf Evaluating the Effect of Overlap:} Here we experimented with $D=100$ and all DNFs had 15 terms of size 5. We considered non read-once DNFs where the number of terms that share the variable $x_1$ varies. The  $x$ axis shows the number of overlapping terms. The training size was $8,500$ for all DNFs. Each dot corresponds to the mean of 10 experiments with different initializations.  (b) {\bf SGD Does not Learn a DNF recovery solution:} We trained a convex network to learn the following 4-term DNF: $(x_1 \land x_2 \land x_3 \land x_4 \land x_5) \lor ( x_2 \land x_3 \land x_4 \land x_5 \land x_6) \lor (x_{10} \land x_{11} \land x_{12} \land x_{13} \land x_{14}) \lor ( x_{11} \land x_{12} \land x_{13} \land x_{14} \land x_{15})$ where $D=20$. The training set size has 15,000 samples and the test classification accuracy is 100\%. \footref{foot:weight_figure_explanation} }
\label{fig:exps2}
\end{figure*} 

We next provide a high level sketch of the proof. The full proof is given in the supplementary. 

First we notice two key properties of any perfect solution $\mth = (\mW, \vb)$ (Definition \ref{def:perfect}):\footnote{Note that the set of perfect solutions is exactly the set of feasible solutions of \eqref{eq:maxmargin}.} 

(1) $\forall\vx \in \sS_+$ there exists $\sI \subseteq [r]$ such that $\sum\limits_{i \in \sI}\vw_i \cdot\vx + b_i \ge 2$. (2) $\forall\vx \in \sS_-$, $\forall i \in [r] \: \:\, \vw_i \cdot \vx + b_i \le 0$. 

These properties are together necessary and sufficient for solutions to be perfect. Next, we show the following upper bound on the bias of every neuron in a perfect solution, which  depends on the weights of the neuron: 
\begin{lem}
\label{lem:main_paper_bias_th_lemma}
Under Assumption \ref{assump:cminus1} and Assumption \ref{assump:population}, if $\mth$ is a perfect solution then every $i \in [r]$ satisfies:
\begin{equation}
    b_i \le  - \normone{\vw_i} + 2 \sum\limits_{n \in [K]} \max\left\{\min\limits_{j \in \sA_{n}} \left\{w_{ij}\right\}, 0\right\}
\end{equation}
\end{lem}

This upper bound turns out to be tight for optimal solution, as shown next. The following result characterizes the parameters of any globally optimal solution $\mth^*$. 
\begin{lem}
\label{lem:main_paper_neuron_characterize}
Under Assumption \ref{assump:cminus1} and Assumption \ref{assump:population}, if $\mth^*$ is a globally optimal solution then:
 \begin{enumerate} 
 \item For every $i \in [r]$:
 \begin{enumerate}[label*=\arabic*.]
    \item The bias achieves the upper bound in Lemma \ref{lem:main_paper_bias_th_lemma}:
    $$
        b_i^* = - \normone{\vw_i^*} + 2 \sum\limits_{n \in [K]} \max\left\{\min\limits_{j \in \sA_{n}} \left\{w_{ij}^*\right\}, 0\right\}
    $$
    \item  $\ds \vw_i^* \ge 0$
    \item $\exists n \in [K]$  such that neuron $i$ aligns with term $n$ or $\vw^*_i = \mathbf{0}, b^*_i = 0$.
    \end{enumerate}
    \item Neurons that align with the same term have the same alignment coefficient.
    \item If $\sI \subseteq [r]$ is the set of all neurons that align with term $n \in [K]$, then $\sum\limits_{i \in \sS}$ $\lambda_i = 1$.
\end{enumerate}
\end{lem}

We prove the correctness of the properties one by one, where each proof relies on the correctness of the previous properties. The structure of the proof of all properties is similar: given a globally optimal solution, we assume by contradiction that it doesn't satisfy a specific property. Then we build a different perfect solution that satisfies this property and has a lower norm than the original globally optimal solution, thus contradicting optimality. The theorem directly follows from the aforementioned properties.





