\section{Problem Formulation}
\label{sec:problem_formulation}

\paragraph{DNFs and Read-Once DNFs:} In what follows, we use $[n]$ to denote the set $\{1,2,...,n\}$. Let $\gX = \left\{\pm1\right\}^D$, where $D$ is the number of variables, and let $\gY = \left\{\pm1\right\}$. Boolean functions \citep[e.g., see][]{o2014analysis} are usually defined on inputs with entries in $\{0,1\}$ to an output in $\{0,1\}$. In this work, we consider DNFs on inputs with entries in $\left\{\pm1\right\}$ and output in $\left\{\pm1\right\}$. 

A DNF is a disjunction of conjunctions over one or more literals. See the DNF in Figure \ref{fig:global_minima} for an example. For convenience, we will use the following notation for DNFs: A DNF with $K$ terms will be defined via $K$ indicator vectors  $\vt_1^*,...,\vt_K^* \in \left\{0, 1\right\}^D$. We refer to each $\vt_n^*$ as a \textit{term} and define its set of active indices by $\sA_n= \group{j\in[D]}{t^*_{nj} = 1}$, where $t^*_{nj}$ is the $j$th entry of $\vt^*_n$. The corresponding DNF will be given by the function $f^*:\gX \rightarrow \gY$ as follows: $f^*(\vx) = 1$ if $\exists n \in [K]\,\, s.t. \ds \vx \cdot \vt^*_n = \left|\sA_n\right|$, and otherwise $f^*(\vx)=-1$. Notice that $f^*$ is monotone. We say that a sample $\vx \in \gX$ satisfies the term $\vt^*_n$ if $\vx \cdot \vt^*_n = \left|\sA_n\right|$. We refer to $\left|\sA_n\right|$ as the size of the term $\vt_n^*$. 

To compare our notation with the standard one, for example, the DNF $(x_1 \wedge x_2) \vee (x_3 \wedge x_4)$ with 4 inputs has terms $\vt_1^*=(1,1,0,0)$ and $\vt_2^*=(0,0,1,1)$. We will use the standard notation when convenient (e.g., as in \figref{fig:global_minima}). 

In this work we will focus on {\bf \textit{read-once}} DNFs where for all $i \neq j \in [K]$, $\sA_i \cap \sA_j = \emptyset$ and the sizes of all the terms are greater than 1.

\paragraph{Learning Setup:} 
Let $\mathcal{D}$ be a  distribution on $\mathcal{X} \times \mathcal{Y}$. We assume that  for $(\vx,y) \sim \mathcal{D}$, $\vx$ is sampled uniformly over the hypercube $\{\pm 1\}^D$ and $y=f^*(\vx)$, where $f^*$ is a monotone read-once DNF.
\footnote{In the case of the uniform distribution and read-once DNFs, we can assume monotone DNFs WLOG. This follows since any negated literal can be replaced with the original literal (without negation) and all our results still hold. Note that for non-read-once DNFs, this will not work because a variable can appear both positively and negatively and flipping its value will not make the DNF monotone.}

 We consider learning $f^*$ given a training set $\sS \subseteq \gX \times \gY$, where for each $(\vx, y) \in \sS$, $\vx$ is sampled IID from $\gD$ and $y = f^*(\vx)$. Denote $\sS_x = \left\{\vx \mid (\vx,y) \in \sS \right\}$, the positive samples by $\sS_+ = \left\{\vx \mid (\vx, 1) \in \sS\right\}$, the negative samples by $\sS_- = \left\{\vx \mid (\vx, -1) \in \sS\right\}$ and the number of samples by $m = |\sS|$. In some cases we will consider the population case where $\sS_x = \gX$. 

\paragraph{Neural Architecture:} We consider a \textbf{convex} one-hidden layer neural network (NN) with $r$ hidden units and parameters $\mth = (\mW, \vb,c) \in \R^{rD} \times \R^r \times \R$ which is defined by:
\be
\label{eq:network}
N(\vx; \mth) = \sum\limits_{i \in [r]} \sigma(\vw_i \cdot \vx + b_i) + c   
\ee
where $\sigma(x) = max\{0, x\}$ is the ReLU function, $\vw_i$ is the $i\ts{th}$ row of $\mW$ and $b_i$ is the $i\ts{th}$ entry of $\vb$. We also use a scalar trainable bias $c\in \sR$ in the second layer to allow for negative outputs. 

The resulting network is positive homogeneous, and thus recent results on such networks can be applied \citep{lyu2019gradient, ji2020directional}. Note that the network is a convex function of its weights  because it is a sum of convex ReLU functions  \citep{amos2017input}.

\paragraph{Loss Minimization:} To learn $f^*$ we consider minimizing the following loss:
\be
\label{eq:loss}
L(\mth) = \frac{1}{m}\sum\limits_{(\vx, y) \in \sS} \ell\left(yN(\vx;\mth)\right)
\ee
where $\ell(z) = \log\left(1+e^{-z}\right)$ is the binary cross entropy loss. We note that $L(\mth)$ is generally non-convex (even though the network $N$ is convex). For our theoretical analyses we consider Gradient Flow (GF). We denote the initialization of GF by $\mtht{0} = \left(\wmat{0},\bvec{0},\cscal{0}\right)$ and the weights at iteration $t$ by $\mtht{t} = \left(\wmat{t},\bvec{t},\cscal{t}\right)$. If the iteration index is clear from context we omit it and use $\mth = (\mW, \vb,c)$. 

Recall that gradient flow is the infinitesimal step limit of gradient descent where $\mtht{t}$ changes continuously in time and satisfies the differential inclusion $\frac{d\mtht{t}}{dt} \in -\partial^{\circ}L(\mtht{t})$ for a.e. $t$. Here $\partial^{\circ}L\left(\mtht{t}\right)$ is the Clarke's sub-differential which is a generalization of the differential for non-differentiable functions:
\begin{align*}
\label{eq:clarke}
\partial^{\circ} f(\vx) = \text{conv}\Big\{  \lim_{k \rightarrow \infty} \nabla f(\vx_k) \ds |&  \ds  \vx_k \rightarrow \vx \text{ and} \numberthis \\ 
&  \text{$f$ is differentiable at } \vx_k \Big\}    
\end{align*}

The differential inclusion allows to take any vector in $-\partial^{\circ}L(\mtht{t})$ in each step of gradient flow. In our case, the differential inclusion has multiple possible values when the ReLU is 0 since ReLU has multiple sub-gradients at 0. For our theoretical results, we will assume that the subgradient of ReLU at $0$ is determined in advance to a value $a \in [0,1]$. This value of the subgradient is used for all neurons and in all iterations. Usually $a$ is set to be either $0$ or $1$. This assumption corresponds to the common way gradient descent runs in practice. We provide a formal definition of this assumption in the supplementary.

Next, we define solutions which perfectly classify the training set. We will consider solutions with a margin constraint, since this will be convenient when we discuss minimum norm solutions.

\begin{defn}
\label{def:perfect}
We say that a solution $\mth$ is \perfect\ if for all $(\vx,y) \in \sS$, $y N(\vx; \mth)  \geq 1$.
\end{defn}

\paragraph{Norm Minimization:}
Multiple recent works have highlighted interesting connections between gradient methods and norm minimization or margin maximization \citep{lyu2019gradient, neyshabur2018towards,nacson2019lexicographic, ji2020directional, chizat2020implicit}. The norm minimization problem is to minimize the norm of the model weights subject to the correct classification with a margin (additional background can be found at  \citep{boyd2004convex}). Namely, the problem is:
\begin{equation}
\begin{array}{ll}
  \min  & \sum_{i \in [r]} \normtwo{(\vw_i, b_i)} +c^2  \\
  \mbox{s.t.} &  y N(\vx; \mth)  \geq 1 \ , \ \forall (\vx,y)\in \sS
\end{array}
\label{eq:maxmargin}
\end{equation}
It was shown \citep{lyu2019gradient,ji2020directional} that under certain conditions, gradient flow converges to KKT points of the optimization problem in \eqref{eq:maxmargin}.

\input{expressive_power}

