
\section{Gradient Flow does not Memorize}

In this section, we prove that gradient flow (GF) does not converge to solutions that memorize training points. 
First we formally define a memorizing neuron and a memorization solution. 
\begin{defn}
\label{def:memorizing_neuron}
A neuron $i \in [r]$ is a {\em memorizing neuron}, if there exists a sample $\hat{\vx} \in \sS_x$ such that:
\be
\label{eq:memorization_condition}
\vw_i \cdot \hat{\vx} + b_i > 0 \text{ and } \forall \vx \in \gX \backslash \{ \hat{\vx}\} \: \: \vw_i \cdot \vx + b_i \le 0
\ee
In this case, we say that neuron $i$ {\em memorizes} $\hat{\vx}$.
\end{defn}

Thus, a neuron $i$ memorizes $\hat{\vx}$ if it is the only point in $\gX$ that has a positive dot product with the neuron. Since the nonlinear activation of the network is ReLU, this implies that only the point $\hat{\vx}$ activates neuron $i$.

\begin{defn}
\label{def:memorization}
$\mth$ is a memorization solution if $\mth$ is perfect (Definition \ref{def:perfect}) and there exists $i \in [r]$ and a sample $\hat{\vx} \in \sS_x$ such that neuron $i$ memorizes $\hat{\vx}$.
\end{defn}

Note that the solution in \figref{fig:memorizetion_global_minimum} is a memorization solution where all positive points in $\sS_+$ are memorized. Definition \ref{def:memorization} defines a broader set of solutions in which at least one point is memorized.

We now state the assumptions that are required for our main result. We apply recent results of  \citet{lyu2019gradient,ji2020directional}, which assume that GF is in the late phase of training. Therefore, we will need the following assumption.
\begin{ass}
\label{assump:late_phase}
There exists $t_0$ s.t. $L\left(\mtht{t_0}\right) < \frac{\ln2}{n}$.
\end{ass}

The next theorem shows that GF cannot converge to memorization solutions.

\begin{thm}
\label{thm:non_converge_to_memorization}
Assume that Assumption \ref{assump:late_phase} holds and $D > 2$, $K \ge 2$. Let $\mth$ be a memorization solution. Then GF does not converge to $\mth$.
\end{thm}

The proof ideas is as follows. \citet{lyu2019gradient} and \citet{ji2020directional} show that under Assumption \ref{assump:late_phase}, GF converges to a KKT point of \eqref{eq:maxmargin}. KKT points must satisfy the KKT conditions: stationarity and complementary slackness (see supplementary for details). We use this fact together with the structure of the subgradient updates to show that memorization solutions cannot satisfy the KKT conditions.

To show this, we first characterize the memorizing neuron using the following lemma:
\begin{lem}
\label{lem:main_paper_memorization_properties}
For $D > 2$. Let $\mth$ be a solution with a neuron $i \in [r]$ that memorizes a sample $\hat{\vx} \in \sS_x$, then $\mth$ satisfies the following properties:
\begin{enumerate}
    \item $\hat{x}_j = \sign(w_{ij})$ for all $1 \le j \le D$.
    \item For $\vx \in \gX$ if $\vw_i \cdot \vx + b_i = 0$ then $\vx \cdot \hat{\vx} = D - 2$.
    \item $b_i < 0$
\end{enumerate}
\end{lem}

Then, by complementary slackness and the non-negativity of the slack variables we show that the stationarity conditions of the weights and biases cannot hold. Thus, memorization solutions are not KKT points and GF cannot converge to them.