\section{Introduction}

The training objective of overparameterized neural networks is non-convex and contains multiple global minima with different generalization properties. Therefore, just minimizing the training objective does not guarantee good generalization performance. Nonetheless, neural networks trained in practice with gradient-based methods show good test performance across numerous tasks \citep{krizhevsky2012imagenet, silver2016mastering}, suggesting an \textit{inductive bias} towards desirable solutions. Understanding this inductive bias and how it depends on the algorithm, architecture and data is one of the major open problems in machine learning \citep{ZhangBHRV16,neyshabur2018towards}.
%townshend2019end

In recent years, there have been major efforts to tackle this challenge. One line of works considers the Neural Tangent Kernel (NTK) approximation of neural networks which reduces to a convex optimization problem \citep{jacot2018neural}. However, it has been shown that the NTK approximation is limited and does not accurately model neural networks as they are used in practice \citep{yehudai2019power, daniely2020learning}. 

Other works tackle the non-convexity directly, usually in very simplified settings  \citep[e.g., diagonal linear networks][]{woodworth2019kernel} or for special cases such as regression with 2-layer models and Gaussian distributions \citep{li2020learning} or infinitely wide two-layer networks \citep{chizat2020implicit}.

\begin{figure*}[t]
\hspace*{\fill}
\begin{subfigure}{.4\textwidth}
  % include second image
  \includegraphics[width=\linewidth]{figures/D=9_recovery.png}
  \caption{}
  \label{fig:gd_global_minimum}
\end{subfigure}
\hspace*{\fill}
\begin{subfigure}{.4\textwidth}
  % include second image
  \includegraphics[width=\linewidth]{figures/D=9_memorization.png}
  \caption{}
  \label{fig:memorizetion_global_minimum}
\end{subfigure}
\hspace*{\fill}
\caption{The weight vectors for two convex networks that
perfectly fit $250$ training samples labeled with the read-once DNF: $(x_1 \land x_2 \land x_3) \lor (x_4 \land x_5 \land x_6) \lor (x_7 \land x_8 \land x_9)$. (a) A network trained with SGD using small Gaussian initialization. The weights can be seen to be well aligned with the DNF terms, and the test accuracy is 100\%. (b) A network whose weight vectors are equal to the positive samples in the training set. This network ``memorizes'' the training data and fits it perfectly. However, the test error is 81.6\%.}
\label{fig:global_minima}
\end{figure*} 

Here we focus on the important problem of learning Boolean functions with neural networks. While much is known about this problem from a computational and statistical perspective, little is understood on how they can be learned with neural networks, and in particular on the inductive bias of gradient descent in this case.  
In computational learning theory, the problem of learning disjunctive normal forms (DNFs) has a long history. Learning DNFs is hard \citep{pitt1988computational} and the best known algorithms for learning DNFs under the uniform distribution run in quasi-polynomial time \citep{verbeurgt1990learning}. On the other hand, for learning \textit{read-once} DNFs under the uniform distribution there exist efficient learning algorithms \citep{mansour2001entropy}.\footnote{In a read-once DNF each literal appears at most once. See Section \ref{sec:problem_formulation} for a  formal definition.} Therefore, it is interesting to understand whether neural networks can learn read-once DNFs under the uniform distribution and this motivates the study of the inductive bias in this case.

We focus specifically on a simple neural architecture with one-hidden layer and ReLU activations, and output weights equal to one. We refer to it as a convex network, because the network output is a convex function of its inputs in this case. It is easy to see that this architecture is sufficiently expressive for learning DNFs. We show that for learning read-once DNFs there exist solutions that perfectly classify the training set with significantly different properties. Specifically, solutions which memorize the training points in their neurons, and other solutions whose neurons align exactly with the terms of the DNF, which we call 
DNF recovery solutions. Figure \ref{fig:global_minima}a-b shows an example of these solutions.\footnote{\label{foot:weight_figure_explanation} The position $(j,i)$ in the figure represents the value of entry $j$ of the weight vector $\vw_i$ (see \eqref{eq:network}) and the color represents its value at the end of the learning process.}

Our first empirical finding is that SGD with small initialization converges to a DNF-recovery solution. This indicates a strong inductive bias of gradient methods towards simple logical forms in this case. We further observe that this bias allows the convex network to generalize better than algorithms designed specifically for learning read-once DNFs. Together these empirical observations establish neural nets as an attractive approach to learning read-once DNFs.

Given the above, we ask what can explain this inductive bias of gradient methods, and what theoretical guarantees can be obtained for its performance. We turn to recent line of works \citep{soudry2018implicit,ji2018gradient,lyu2019gradient,ji2020directional,chizat2020implicit} which study the inductive bias of gradient flow (GF) in several settings. Their results suggest that the inductive bias of GF is to Karush-Kuhn-Tucker (KKT) points or global solutions of minimum norm problems (or analogously maximum margin problems). Motivated by these results, we prove that any norm minimizing solution in our setting is a DNF Recovery solution, strongly suggesting why it is that GF converges to it. We further strengthen this result by proving that memorizing solutions (namely, solutions where there are neurons that are only activated by specific inputs, as in Figure \ref{fig:memorizetion_global_minimum}) are not KKT points of the min-norm problem. Therefore, GF will not converge to these.

We corroborate our findings with empirical results which show that our conclusions hold more broadly. Specifically, we perform experiments on DNFs of higher dimension and standard one-hidden layer neural networks. Taken together, our results demonstrate that gradient methods can recover simple descriptions of Boolean functions from data, which results in good generalization performance and may have important implications on the question of interpretability.