\section{Empirical Observations on the Inductive Bias}
\label{sec:empirical_observations}

Overparameterized networks (i.e., large $r$), can fit the training data with multiple different solutions. Some have good generalization performance while others overfit. When training is performed with gradient methods, only certain solutions will be found and not others. In other words, gradient methods have an \textit{inductive bias} towards certain solutions. Understanding this inductive bias is key for understanding the generalization performance of gradient methods in practice.


\figref{fig:gd_global_minimum} shows an example of the weight vectors learned by an overparameterized convex network (\eqref{eq:network}) optimized with SGD. It can be seen that although the network has many weight vectors, they form tight clusters around the terms of the true underlying DNF. This network has perfect accuracy on both the training set and the test set. 

We further devise a simple procedure to recover the DNF terms from the neurons. The procedure removes low norm neurons and rounds the weights, the exact details are provided in the supplementary. Figure \ref{fig:D=9_reconstruction} shows that the DNF recovery procedure can accurately find the DNF terms. In the supplementary we provide further details on this experiment and show many more examples of the inductive bias of SGD towards solutions which align with terms. 

Another solution that minimizes the training error is shown in \figref{fig:memorizetion_global_minimum}. In this solution, the network memorizes in its neurons the positive training points.\footnote{See a formal definition of memorization in the next section.} However, this network does not generalize well, and has 81\% test accuracy. 
Thus, we see that SGD converges to the true DNF and not to a ``memorization'' solution, despite the latter also minimizing the training loss. 

These observations raise the following intriguing questions:
\begin{enumerate}
    \item Why do gradient methods have an inductive bias towards solutions that align with the terms of the DNF?
    \item Why do gradient methods not converge to solutions that overfit and memorize training points in their neurons?
\end{enumerate}

In the next sections we provide  theoretical results which address these questions. 
