\section{Related Work}
\label{sec:related_work}

Recently, several works studied the inductive bias of neural networks and showed connections between gradient methods and  margin maximization \citep{chizat2020implicit, lyu2019gradient, nacson2019lexicographic,ji2020directional}. These works motivate our theoretical analysis of minimum norm solutions. In our theoretical analysis, we apply the results of \citet{lyu2019gradient, ji2020directional}, which show that GF is biased towards KKT points of min-norm problems.

Other works study fully connected neural networks under certain assumptions on the data such as linearly separable data \citep{brutzkus2018sgd,sarussi2021towards,frei2021provable} or Gaussian data \citep{safran2018spurious,du2018gradient2}. \citet{malach2020implications} show that certain structured Boolean circuits can be learned with a network architecture that is specialized for their data structure. 

Fully connected networks were also analyzed via the NTK approximation \citep{jacot2018neural,du2018gradient2, du2018gradient, arora2019fine, ji2019polylogarithmic, cao2019generalization, jacot2018neural, fiat2019decoupling, allen2019convergence, li2018learning, daniely2016toward}. However, \citet{yehudai2019power, daniely2020learning} have highlighted limitations of the NTK framework, suggesting that it does not accurately model neural networks as they are used in practice.

Another line of works \citep{saad1996dynamics,goldt2019dynamics,tian2019student} studies neural networks in student-teacher regression settings and shows a ``specialization'' effect, where a subset of student neurons aligns with teacher neurons. The main difference from our setting is that we consider classification on binary data, and they consider regression tasks on non-discrete data (e.g., Gaussian). We note that classification settings present unique theoretical challenges for studying inductive bias of gradient methods \citep{montanari2019generalization}.

In a recent result, \citet{phuonginductive} provide an end-to-end optimization analysis of a two-layer ReLU network on orthogonally separable data (which is a simplified setup of linearly separable data). They consider the cross-entropy loss and their analysis implies that neurons specialize to certain directions. We focus on a significantly more challenging setting, where the training data corresponds to a read-once DNF, and is generally not linearly separable.

An inductive bias towards specializing solutions has also been
observed in \citet{brutzkus2019larger} and proved for a
simple setup with nonlinear data and a convolutional neural network. The notion of specialization is also related to the notion of ``collapse'' \citep{papyan2020prevalence}. We note that in our setting we do not observe the collapse phenomenon since the hidden-layer representations of the positive samples are not all in one small cluster (e.g., see Figure \ref{fig:gd_global_minimum}).\footnote{For the model in the figure, the representations of the positive points create 8 different clusters which correspond to each possible combination of the three terms.}

\citet{rudin2019stop} argues that methods for explaining large neural networks should be avoided because networks are too complex for humans to understand. However, we show, albeit in a restricted setting, that learned networks can be rather simple, and are easily mapped to the underlying DNF.
