\section{Empirical Results}
\label{sec:empirical}

In this section we perform numerous experiments that support our analysis and show that our conclusions hold in different settings. For each experiment we show a sample of the empirical results due to space constraints. Further details and results are provided in the supplementary.


{\bf Comparing convex and standard networks:} Our analysis focused on a convex network. Here we compare it to a standard two-layer network with trainable output layer. Figure \ref{fig:comparison} reports results, showing that the convex network outperforms the standard one for a DNF with $D=25$. This shows that the convex network is a good model for studying inductive bias in our setting.

{\bf Comparing large and small initializations:} In Figure \ref{fig:comparison} we compare a convex network with small initialization and a convex network with large initialization which is analogous to training in the NTK regime \citep{chizat2019lazy}. The small initialization convex network performs better. We also show in Figure \ref{fig:cluster_readonce_memorizarion} that the small initialization network converges to a solution that aligns with the terms of the DNF while the large initialization network does not.

{\bf DNFs with large input dimension:} Here we show an experiment for learning a DNF with 15 terms of size 5 and $D=100$. We learned the DNF using a convex network and SGD with small Gaussian initialization and 15,000 training samples drawn from the uniform distribution. In Figure \ref{fig:cluster_readonce}, we see that SGD converges to a solution that aligns with the terms of the DNF and has 100 $\%$ test accuracy.

{\bf Non read-once DNFs:} Our theoretical work is restricted to read-once DNFs. To get a better understanding of what happens beyond the read-once case, we perform a series of experiments for learning DNFs with overlapping terms. \figref{fig:overlap} shows that when we increase the number of overlapping terms, the generalization error gets worse. 

\figref{fig:cluster_overlap} shows an example of the neurons when learning a DNF with 4 overlapping terms. Here, the neurons do not align with the terms, and therefore the inductive bias is different from DNF recovery solutions.

The above results suggest that when the overlap is introduced to the learned DNFs, it becomes harder to recover the DNF and generalize well. This observation is in line with the fact that the known polynomial bound for learning monotone read-$k$ DNF \citep{mansour2001entropy} increases with $k$. Indeed, $k$ is the number of times each variable can appear in the DNF and a larger value indicates that there is more overlap between the terms. Furthermore, known hardness results for learning general DNFs \citep{pitt1988computational} also coincide with this empirical observation.


{\bf Experiments on Tabular Datasets:} The fact that SGD recovers simple Boolean formulas is very attractive in the context of interpretability. In Section \ref{sec:empirical_observations} we showed that we can reconstruct DNFs under certain idealized assumptions (e.g., uniform distribution, read-once). However, our reconstruction method might produce meaningless reconstructions on datasets which are not uniform nor labeled with a read-once DNF. We tested our reconstruction method on three tabular UCI datasets kr-vs-kp, diabetes and Splice \citep{Dua:2019}. We note that these datasets do not contain personally identifiable information or offensive content.

Learning with our convex network resulted in test accuracies of 100\%, 98\% and 97\% on these datasets, respectively. Our reconstruction method obtained a small DNF (6 terms of size less than 4) on kr-vs-kp with test accuracy 91\%. For diabetes, the reconstruction method returned a large DNF (more than 10 terms) with test accuracy $81\%$. On Splice we got a 2-term DNF of sizes 2 and 3 with $95\%$ test accuracy. The latter is a very compact DNF with very small loss in accuracy, illustrating the potential of recovery on interpretability.


