% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent

\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)
\usepackage{hyperref}
\usepackage{smile}


\title{
Exact Count of Boundary Pieces of ReLU Classifiers: \\
Towards the Proper Complexity Measure for Classification}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Pawe\l~Piwek}
\author[2]{Adam~Klukowski}
\author[2]{\href{mailto:<hutianyang.up@outlook.com>?Subject=Your UAI 2023 paper}{Tianyang~Hu}}
% Add affiliations after the authors
\affil[1]{%
    University of Oxford\\ 
    \texttt{pawel.piwek@maths.ox.ac.uk}
}
\affil[2]{%
    Huawei Noah's Ark Lab\\
    \texttt{hutianyang1@huawei.com}
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \theoremstyle{plain}
% \newtheorem{theorem}{Theorem}[section]
% \newtheorem{proposition}[theorem]{Proposition}
% \newtheorem{lemma}[theorem]{Lemma}
% \newtheorem{corollary}[theorem]{Corollary}
% \theoremstyle{definition}
% \newtheorem{definition}[theorem]{Definition}
% \newtheorem{assumption}[theorem]{Assumption}
% \theoremstyle{remark}
% \newtheorem{remark}[theorem]{Remark}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
\usepackage[textsize=tiny]{todonotes}

% Added by adam
\usepackage{svg}
\usepackage[capitalize,noabbrev]{cleveref}

% \newtheorem{proposition}{Proposition}
% \theoremstyle{definition}
% \newtheorem{definition}{Definition}
% \newtheorem{eg}{Example}
% \theoremstyle{remark}
% \newtheorem*{note}{Note}

\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\x}{\vec{x}}
\newcommand{\y}{\vec{y}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\D}{\mathcal{D}}
\newcommand{\U}{\mathcal{U}}
\newcommand{\F}{\mathcal{F}}
\newcommand{\PU}{\mathcal{PU}}
\newcommand{\T}{\mathcal{T}}

\usepackage{xcolor}

\counterwithout{theorem}{section}
\crefname{example}{Example}{Examples}
\crefname{proposition}{Proposition}{Propositions}

\begin{document}

\maketitle




\begin{abstract}
Classic learning theory suggests that proper regularization is the key to good generalization and robustness. In classification, current training schemes only target the complexity of the classifier itself, which can be misleading and ineffective. 
Instead, we advocate directly measuring the complexity of the decision boundary. 
Existing literature is limited in this area with few well-established definitions of boundary complexity. As a proof of concept, we start by analyzing ReLU neural networks, whose boundary complexity can be conveniently characterized by the number of affine pieces. With the help of tropical geometry, we develop a novel method that can explicitly count the exact number of boundary pieces, and as a by-product, the exact number of total affine pieces. Numerical experiments are conducted and distinctive properties of our boundary complexity are uncovered. First, the boundary piece count appears largely independent of other measures, e.g., total piece count, and $l_2$ norm of weights, during the training process. Second, the boundary piece count is negatively correlated with robustness, where popular robust training techniques, e.g., adversarial training or random noise injection, are found to reduce the number of boundary pieces. 

\end{abstract}



\section{Background}

% \paragraph{Adversarial vulnerability} 
Despite deep learning's huge success in image classification, naturally trained deep classifiers are found to be adversarially vulnerable \citep{goodfellow2014explaining, goodfellow2016deep}. 
By adding a small perturbation (adversarial attack) to an image, which is almost imperceptible to humans, the neural network's predicted class can be arbitrarily manipulated. 
The prevalence of adversarial examples for state-of-the-art deep classifiers, even on small datasets such as CIFAR \citep{krizhevsky2009learning}, suggests overfitting, where decision boundaries of trained deep neural networks (DNNs) are \textit{overly complicated} and within a small distance to almost all the training instances. 
% The lack of classification robustness casts doubt on the trustworthiness of machine intelligence. 
% \paragraph{Definition of classification robustness}
Ideally, we want our model to generalize well on unseen data and be robust against small input perturbations, i.e., the prediction doesn't change much in case of small random noises. 
For regression, the requirement loosely translates to the smoothness of the predictor function. 
However, it becomes drastically different for classification, due to the discrete nature of class labels. 

The goal of classification is to recover the Bayes optimal decision boundary with the lowest misclassification rate (0-1 loss). 
% Robustness of a classifier can be generally measured by 
% \begin{equation}\label{eqn:dfn}
%     \PP_{\bx, \bdelta}\rbr{f_{\mathrm{pred}}(\bx) = f_{\mathrm{pred}}(x+\bdelta)}.
% \end{equation}
% In order to be robust, the decision boundary has to be distant to all of the data points. 
Decision boundary corresponds to certain level sets of the classifiers, which is more difficult to control than the classifier itself. 
% It has been pointed out in literature that robustness may be at odds with accuracy or generalization error \citep{tsipras2018robustness}, especially if the classes are not well-separated. 
% The trade-off between accuracy and robustness is not generally considered a problem for deep learning and typical tasks in computer vision or natural language processing \cite{goodfellow2016deep}. 
As is often the case, especially in image classification, the classes can be thought of as separable with positive margins, i.e., the class labels have no randomness and images in different classes reside in non-overlapping regions with positive pairwise distances. 
In this case, there are infinitely many possible decision boundaries with zero misclassification error, but only some of them are robust with good generalization properties. 
Current training methods offer little control over the selection process and the resulting decision boundaries often turn out to be unsatisfactory. 
For natural data, it is commonly believed that an ideal decision boundary (e.g., human's), which offers both good accuracy and robustness, should not be too complicated. 
In practice, how to effectively find such decision boundaries can be a real challenge. 

% When the optimal decision boundary is not uniquely defined, there are mainly two ways to induce uniqueness. One is by maximizing margin, one is by noise injection. 
% Margin maximization has deep connections with adversarial training \citep{ding2018mma} and adding noises is a common practice to improve generalization and robustness.  
% While max-margin decision boundary and noise induced decision boundary can be fundamentally different, they both offer some robustness guarantee and shouldn't be too complicated for natural data.



% \paragraph{Necessity of regularization} 
Let $\cF$ denote some function space.
In learning theory, the model complexity (how large is $\cF$) is of critical importance, especially for model generalization and robustness \citep{vapnik1999overview, bousquet2002stability, james2013introduction}. 
Certain types of regularization are necessary to prevent over-complication and overfitting of the training data. The same is also true in deep learning, where modern networks are usually overparametrized. 
% However, regularization of deep models can be tricky and often not explicit. 
% Modern networks are usually overparametrized, with much more trainable parameters than the sample size, and empirically seem to work well without explicit regularization terms in the objective function. 
Various regularization techniques have been developed for training DNNs, e.g., weight decay, dropout \citep{srivastava2014dropout}, batch normalization \citep{ioffe2015batch}, early stopping \citep{prechelt1998early}, etc.
Though their regularization effects are largely implicit, a variety of implicit biases have been recently identified \citep{woodworth2019kernel, chizat2020implicit, razin2020implicit, hu2021regularization, ding2023random}. 
Nevertheless, without exception, all aforementioned types of regularization are on the \textit{functional} level, i.e., regularizing $\cF$ with respect to some complexity measurement. 
However, as we will point out in the next section, the complexity of $\cF$ itself is not of the most interest in classification. Instead, what matters the most are the \textit{level sets} of $\cF$.


\section{Proper Regularization for Classification}
For a function $f:\RR^d\to\mathbb{R}$, let
$\|f\|_\infty=\sup_{\bx\in\RR^d}|f(\bx)|$. 
%$L_p$ and $l_p$ are used to distinguish function norms and vector norms. 
Let $\PP$ be a probability measure on $\RR^d$ and denote $d_\triangle(G_1, G_2)= \PP(G_1\triangle G_2)= \PP\left((G_1\backslash G_2)\cup (G_2\backslash G_1)\right)$ as the measure of the symmetric difference of sets in $\RR^d$. 


Consider the binary classification setting where $\bx\in\RR^d$, $y\in\{-1, 1\}$. 
Let the conditional probability $\eta(\bx)=\PP(y=1|\bx)$. 
Given $\eta(\bx)$, the Bayes optimal decision rule is to assign label $1$ if $\eta(\bx)\ge 1/2$ and label $-1$ if $\eta(\bx) < 1/2$. If the two classes are separated (the supports of two class distributions are disjoint), $\eta$ is a piecewise constant function taking values only from $\{0, 1\}$.
The 0-1 loss is not friendly for optimization \citep{bartlett2006convexity}. Thus, various surrogate losses are employed in practice, e.g., cross-entropy, hinge loss, etc.
In statistics literature, there are two types of assumptions for classification \citep{audibert2007fast}, one on the conditional probability and the other on the decision boundary. 
Classification by estimating the conditional probability is usually referred to as "plug-in" classifiers and it's worth noting that it essentially reduces classification to regression. In comparison, estimating the decision boundary is more fundamental \citep{hastie2009elements}. Hence, characterizing the decision boundary is of critical importance. 




\subsection{From Function Space to Level Set}
The goal of classification is to recover the Bayes optimal decision boundary, which divides the input space into non-overlapping regions with respect to labels.  
Therefore, classification is better to be thought of as \emph{estimation of sets} in $\RR^d$, rather than estimation of functions on $\RR^d$.
This is because the set difference reflects the 0-1 loss much more directly than functional norms on $\cF$. To be more specific, if $f\in\cF$ approximates $\eta$ so well that $\|f(\bx)-\eta(\bx)\|_\infty \le 2\epsilon$, there is still no guarantee of matching the sign of $\eta(\bx)-1/2$ close to the decision boundary. Consider a noisy scenario, where the label we observe is flipped relative to the true label with probability $\left(\tfrac{1}{2} - \epsilon\right)$. Then the misclassification rate of $f$ could be arbitrarily bad. 
In contrast, if we have a good estimation of the set $G^* = \{\bx\in \RR^d: \eta(\bx)\ge 1/2\}$ such that $d_\triangle(\hat{G}, G^*)\le \epsilon$, the misclassification probability can be directly bounded by $\epsilon$. 


\begin{figure}[t]
    \centering
    \includegraphics[width=0.48\textwidth
    ]{figures/spiral}
    \caption{Illustration of a difficult classification task in $[-1,1]^2$ using ReLU classifiers. Two classes (blue and red) are separated. Among all the points, only 300 in each class are training samples, marked with \textbf{thickened} outline. The left figure is from regular training, achieving 99.65\% test accuracy; the right figure is from adversarial training, achieving 100\% test accuracy. The decision boundary on the right is more robust and noticeably less complicated. }
    \label{fig:spiral}
    \vspace{-5mm}
\end{figure}


In practice, the deep classifier is parametrized by a neural network $f\in\cF$ and the decision boundary is its \emph{level set}, $G_f:=\{\bx\in\RR^d: f(\bx)=0\}$, which is modeled \emph{implicitly}. 
Let $\cG = \{G_f: f\in\cF\}$. 
Notice that regularizing $f$ may have no effect on $G_f$ since the level set is invariant to scaling of $f$.
To be more specific, $f(\bx)$ and $\lambda \cdot f(\bx)$ have the same level set, and as $\lambda\to 0$, the majority of commonly used function norms $\|f(\bx)\|$ will tend to zero.
Hence, the complexity of $\cF$ and the complexity of $\cG$ may not be closely connected. 
% The adversarial vulnerability may stem from the fact that existing training techniques on $\cF$ fail to effectively regularize the complexity of the estimated decision boundary.  

% \paragraph{Implicit regularization on decision boundary}
When explicit regularization is absent in training deep classifiers, one may hope the decision boundary complexity is implicitly regularized, either from the model architecture or the training techniques. Unfortunately, this is not supported by empirical evidence in robust transfer learning \citep{shafahi2019adversarially}. Given an adversarially robust teacher model, e.g., from adversarial training, only by vanilla knowledge distillation \citep{hinton2015distilling} and fitting the input-output relationship, the resulting student model, no matter the size, does not retain robustness. 
To achieve comparable robustness, data augmentation on the input space such as mixup samples \citep{muhammad2021mixacm}, or matching intermediate features \citep{goldblum2020adversarially} seems indispensable. 
While matching the classifiers cannot transfer robustness, matching the decision boundary from teacher to student obviously can. 
From this perspective, various data augmentations can be viewed as regularization of the input space, on the decision boundary.   

Adversarial training, noise injection, and margin maximization can all be viewed as means of boundary regularization, pushing decision boundaries away from training samples. 
We show empirically that these methods lead to a significant reduction in boundary complexity, even though their design motivation was different.
Adversarial training can be also viewed as a special form of gradient regularization \citep{lyu2015unified}, or data-dependent operator norm regularization \citep{roth2019adversarial}. Among others, \citet{chan2019jacobian} proposed to directly regularize the saliency of the classifier's Jacobian to improve robustness. 
Adversarial robustness is also shown to improve by replacing the ReLU activation with smooth functions \citep{xie2020smooth}, and modifying the loss function \citep{pang2019rethinking, bao2020calibrated, hu2021understanding}. 
Although the classifier gradient is more related to boundary complexity, these types of regularization methods inspired by adversarial training are not directly targeting the decision boundary. 

In this work, we advocate that for classification, the proper complexity to regularize is the boundary complexity of $\cG$, rather than the functional complexity of $\cF$.
A complexity measurement directly targeting the decision boundary will better reflect classification properties and may be largely independent of known metrics on the function space. 

% \subsection{Relations to Existing Methods} 
% % \paragraph{Max margin and Adversarial Training}
% % Adversarial training is by far the most effective method for obtaining robustness for deep classifiers. It can be seen as a special way of noise injection \citep{he2019parametric} with connections to margin maximization \citep{ding2018mma, elsayed2018large}. 
% % These types of training methods all work on the input space and can have direct impact on the decision boundary.  
% % In comparison, regularizing the boundary complexity is more fundamental with deep roots in learning theory, which should serve as the baseline.  
% % As we argued before, current training lacks proper regularization on the boundary complexity and these methods seem to be a post-fix of the unregularized decision boundary. 
% Adversarial training, noise injection and margin maximization can all be viewed as means of boundary regularization, pushing decision boundaries away from training samples. Although not with the aim to reduce complexity, we show in simulations that they can lead to significantly reduced boundary complexity. 
% Adversarial training can be also viewed as a special form of gradient regularization \citep{lyu2015unified}, or data-dependent operator norm regularization \citep{roth2019adversarial}. Among others, \citet{chan2019jacobian} proposed to directly regularize the saliency of the classifier's Jacobian to improve robustness. 
% Adversarial robustness is also shown to improve by replacing ReLU activation with smooth functions \citep{xie2020smooth}, and modifying the loss function \citep{pang2019rethinking, bao2020calibrated, hu2021understanding}. 
% Although the gradient of classifiers is more related to boundary complexity, these types of regularization methods inspired by adversarial training is not directly targeting the decision boundary. 

% The level set of neural networks have been investigated in literature. Specifically, \citet{atzmon2019controlling} proposed to directly control the level set via a sampling-based method that can acquire points from decision boundary. 
% Even though \cite{atzmon2019controlling} also targeted the decision boundary of neural networks, its goal is not controlling its complexity, but to enlarge margins and reconstruct shapes.

% % \paragraph{Certifiable robustness may be too restrictive}
% % Another recent line of research has focused on certifiable robustness of classifiers with provable guarantees \citep{cohen2019certified, croce2019provable}. 
% % Most works try to compute or lower bound the minimum perturbation necessary to change the prediction for each test instance \citep{katz2017reluplex, tjeng2017evaluating, hein2017formal, raghunathan2018certified}. 
% % Theoretical guarantees are valuable, especially for safety-critical tasks. However, it may be too restrictive for general purpose and the certification takes a big toll on model performance and computation efficiency. 

% Regularization on boundary complexity and the aforementioned existing works are different perspectives of classification robustness that do not exclude each other. Proper regularization on the decision boundary may further improve the robustness of the state-of-the-art methods. 


\subsection{Measuring Boundary Complexity}
Now that we have established boundary complexity as the proper, yet missing regularization in classification, the next question is how to measure it.
Compared to functions, boundary complexity measurement is far less explored. 
In statistics literature, classification has been analyzed as a nonparametric estimation of sets problem where the convergence rate critically depends on the complexity of the hypothesis class and the estimator class \citep{mammen1999smooth}.
However, the typical complexity measurements, e.g., bracketing entropy, covering number, Rademacher complexity, etc. are on the group level and cannot evaluate a single set (decision boundary). 
For general classifiers, how to properly quantify the boundary complexity remains an open problem.
\cite{chen2019topological} utilized persistent homology to measure the topological complexity of decision boundaries. \cite{lei2022understanding} characterized boundary complexity by their variability with respect to data and algorithm randomness. \cite{yang2020boundary} proposed the concept of boundary thickness and demonstrated its relationship to classification robustness. However, the aforementioned characterizations of boundary complexity are highly abstract and not explicitly calculable.  

To this end, we consider specifically classifiers with Rectified Linear unit (ReLU) activation, whose decision boundary is piecewise linear, and the boundary complexity can be conveniently characterized by the number of affine pieces, which is intuitive and visually accessible. 
In Figure \ref{fig:spiral}, the left decision boundary has 491 affine pieces while the right one has only 254.
As can be seen in the figure, the less complicated boundary generalizes better and is more robust. 

\begin{remark}[Boundary pieces]
    The count of boundary pieces of ReLU networks might be overly simplified for classification problems, since it does not take the length of each piece and their overall structure into consideration. However, it does offer unique benefits. Besides being intuitive and visually accessible, it also bridges the complexity of the ReLU network itself. It would be interesting to see the relationship between the count of boundary pieces and the total number of linear pieces during training. Other boundary complexities, e.g., boundary thickness, have no counterpart in the function space. 
\end{remark}

% \paragraph{ReLU Networks}
For ReLU neural networks, the structure of the affine pieces and, in particular, the number of distinct pieces have been objects of interest.
Sharp bounds (exponential with depth) on the maximum number of affine regions have been investigated \citep{montufar2014number}, demonstrating the benefit of deeper networks. 
\cite{hanin2019complexity} provided a framework to count the number of linear regions of a piecewise linear network.
A method for upper-bounding the number of affine regions \emph{locally} in a ball around a data point was developed in \cite{zhu2020bounding}. Interestingly, both experiments of \cite{zhu2020bounding} on local number of affine regions and ours on global count of boundary pieces indicate a two-stage behaviour during training.

In classification, we are interested in the boundary pieces (level set) more than in affine regions, and existing literature there is scarce. 
For counting, previous works only compute a \emph{superset} of the decision boundary and therefore give only upper bounds on the exact number (see Proposition 6.1. in \cite{zhang2018tropical} and \cite{alfarra2020decision}).
For linking the count to classification, to the best of the authors' knowledge, the only relevant work is \cite{hu2020sharp}, where a teacher-student classification setting is considered and upper bounds on boundary pieces (bracketing entropy) in ReLU classifiers are utilized to bound the generalization error. Interestingly, \cite{hu2020sharp} showed that when the student network is larger than the teacher, if the boundary complexity is not regularized, the 0-1 loss excess risk convergence rate will not be rate-optimal. 

As we illustrated before, a ReLU network and its level set may share little connection. Calculating the number of boundary pieces is a new and technically challenging problem. Although there might be other ways to characterize the boundary complexity, the boundary piece count does provide a valid starting point for this problem. 

\subsection{Contributions}

In this work, we study the boundary complexity of ReLU classifiers and investigate the number of affine pieces in the decision boundary. The contributions are
\begin{itemize}
    \item 
    With the help of tropical geometry, we provide a novel explicit algorithm for counting the exact number of boundary pieces and affine regions of ReLU networks. In contrast to \cite{zhang2018tropical} and \cite{alfarra2020decision}, we do not require the weights to be integer-valued. Unlike the algorithm of \cite{zhu2020bounding}, which discards some information at each layer, our approach preserves a complete representation of a neural network's functional form.
    
    \item
     We empirically investigate our proposed boundary complexity during training and interesting properties are revealed. 
     First, the boundary piece count is largely independent of other measures during training.
     They (e.g., boundary count, total piece count, and $l_2$ norm of weights) share little similarity during the training process.
     Second, the boundary piece count is negatively correlated with robustness. Adversarial training and noise injection are found to have significant regularizing effects on boundary complexity. 
\end{itemize}




% In this work, we investigate classification robustness from the perspective of learning theory and regularization. 
% Firstly, we hypothesize that a potential reason for adversarial vulnerability is the lack of proper regularization. For classification robustness, the rightful complexity to regularize is the boundary complexity, which is currently missing in practice as well as literature.
% Secondly, as a starting point, we consider ReLU classifier, whose boundary complexity can be conveniently characterized by its number of linear pieces. With the help of tropical geometry, we develop a novel method that can explicitly calculate the number of boundary pieces. 
% Thirdly, through numerical experiments, we confirm our hypothesis that the boundary complexity has significant negative correlation with classification robustness and behaves quite uniquely during training comparing to various other complexity measurements. 

\section{Boundary Complexity of ReLU Networks}

% ReLU networks define piecewise linear functions. The structure of the linear pieces and, in particular, the number of distinct linear pieces have been objects of interest, providing insight into the \emph{complexity} of the function space that a given architecture parameterizes.


A few works \cite{alfarra2020decision, charisopoulos2018tropical, hertrich2021towards, maragos2021tropical, montufar2021sharp, trimmel2020tropex, zhang2018tropical} on this topic used the ideas of \emph{tropical geometry} - an area of algebraic geometry studying surfaces over the max-plus semi-ring \cite{maclagan2009introduction}. The connection to ReLU networks comes from them being compositions of affine transformations and the rectified linear unit \(\sigma(x) = \max \{0,x\}\). This enables us to write the network as a difference between two convex piece-wise affine functions.
These, in turn, can be interpreted in a useful way in a \emph{dual space}, where affine functions are points and maximum functions correspond to upper convex hulls.
This interpretation allowed \citeauthor{zhang2018tropical} to reprove the best bounds for the largest possible number of affine regions a ReLU network with a given architecture may have.

% \subsection*{Our contribution}

% The contributions of this work are of theoretical nature, giving exact descriptions or frameworks for dealing with ReLU networks.

% Firstly, we give an algorithm for computing the \emph{tropical hypersurface} (region of the derivative discontinuity) of a convex piece-wise linear function. Other works\todo{Make it precise, list the other works.} just specify the ``shape" of the hypersurface and that its components are normal to the appropriate cells of the Newton's polytope subdivision.\todo{Reference Newton's polytope later in the paper where appropriate.}



%%%%%%%%%%%




This section expands on the tropical geometry perspective of ReLU networks. Our main theoretical result is a way to explicitly compute the zero set of a difference of two convex piecewise-affine functions---and therefore compute the exact count of boundary pieces of a ReLU network. 
To improve the readability, we include necessary preliminary results and rephrase them into consistent technical language.
The proofs are mostly omitted and can be found in the appendix.


% \subsection*{ReLU Networks}

Let's start with a proposition taken from \cite{magnani2009convex}.
\begin{proposition}\label[proposition]{CPLs}
    A function of the form
    \[f(\x) = \max_{i=1,\ldots,n}\{A_i\x+b_i\}\]
    is convex and piecewise-affine. Also, every convex piecewise-affine function with a finite number of linear pieces is of this form.
\end{proposition}

We will proceed to abbreviate ``convex piecewise-affine'' to CPA and ``difference of convex piecewise-affine'' to DCPA. To be precise, by a ReLU network we mean a neural network where every activation function is the rectified linear unit.

\begin{proposition}\label[proposition]{DCPAs}
    Given any ReLU network, the function defined by it can be written as a DCPA function.
\end{proposition}

Conversely, \cite{ovchinnikov2002max} proved that any piecewise-affine function
with a finite number of linear regions
is a min-max polynomial in its component affine functions.
This implies that it can be written as a DCPA function
and so -- represented by a ReLU network.

\subsection{Tropical Geometry}

In this section, we introduce the aforementioned
interpretation of CPAs in the \emph{dual space} \(\mathbf{D}\).
It may resemble a projective involution,
which makes it even more surprising that notions such as convex hull turn out useful.
We make no distinction between affine functions \(f : \x \mapsto \vec{a}^\intercal \x + b\)
and their graphs \(\{(\x, y) \in \R^{d+1}\ |\ y = f(\x)\}\).
Thus, we identify affine functions \(\R^d \rightarrow \R\)
with hyperplanes in \(\R^{d+1}\)
containing no vertical lines (\(\{\x_0\} \times \R \subseteq \R^{d+1}\)
for some \(\x_0 \in \R^d\));
this ambient \(\R^{d+1}\) will be called the \emph{real space}
and denoted \(\mathbf{R}\).

We make effort to distinguish between \(\mathbf{R}\) and \(\mathbf{D}\)
as both are copies of \(\R^{d+1}\) which may cause confusion.

\begin{definition}
    We say that \((\x, y)\) lies above (the graph of) \(f\) when \(y > f(\x)\). We denote it by \((\x, y) \succ f\).
\end{definition}

\begin{definition}
    For an affine function \(f: \R^d \to \R\) given by \(f(\x) = \vec{a}^\intercal \x + b\),
    we define its \emph{dual} \(\cR^{-1}(f)\)
    as the point \((\vec{a},b)\in \R^{d+1} =: \mathbf{D}\).
    Accordingly, this \(\R^{d+1}\) will be called the \emph{dual space} and denoted \(\mathbf{D}\).
    Conversely, for a dual point \(\vec{c} = (\vec{a}, b) \in \mathbf{D}\),
    we define \(\cR(\vec{c})\) to be the affine function \(\x \mapsto \vec{a}^\intercal \x + b\)
    (i.e. a hyperplane in \(\mathbf{R}\)).
\end{definition}

As we will see from~\cref{duality},
\(\cR\) turns out to interchange the relations of collinearity and concurrence,
extend to planes of any dimensionalities,
preserve orthogonality and sides of hyperplanes.
For consistency, we set: 
\begin{definition}
    To a real point \(\vec{z} = (\x, y) \in \mathbf{R}\),
    we associate as its dual the following hyperplane in \(\mathbf{D}\)
    \[\cR^{-1} (\vec{z}) = (\vec{a} \mapsto (-\x)^\intercal \vec{a} + y).\]
    Conversely, to a dual hyperplane
    \(H = (\vec{a} \mapsto \x^\intercal \vec{a} + y) \subset \mathbf{D}\),
    we associate the real point \[\cR(H) = (-\x, y) \in \mathbf{R}.\]
\end{definition}

Note that the correspondence between dual hyperplanes and real points has an extra sign not present in the pairing of dual points with real planes.

\begin{proposition}\label[proposition]{duality}
    The duality \(\cR\) has the following properties:
    \begin{enumerate}
        \item 
        A dual point \(\vec{c} \in \mathbf{D}\)
        lies on a dual hyperplane \({H \subset \mathbf{D}}\)
        if~and~only~if the corresponding real hyperplane \(\cR(\vec{c}) \subset \mathbf{R}\)
        contains the point \(\cR(H) \in \mathbf{R}\).
        I.e.
        \[\vec{c} \in H \Leftrightarrow \cR(\vec{c}) \ni \cR(H).\]
        \item 
        Points of a dual \(k\)-dimensional plane \(F\) are precisely the duals of real hyperplanes containing some \((d-k)\)-dimensional real plane. We denote this common real \((d-k)\)-dimensional hyperplane as \(\cR(F)\).
        \item 
        Duality is containment-reversing, i.e., \[F \subseteq G \Leftrightarrow \cR(F) \supseteq \cR(G)\] for dual planes \(F, G\), and analogously for \(\cR^{-1}\).
        \item 
        For any real hyperplane \(f\), the projection \(p(\cR^{-1}(f))\) of its dual \(\cR^{-1}(f)\) onto the first \(d\) coordinates is normal to its isolines \(\{\x\ |\ f(\x) = \text{const.}\}\).
        \item 
        Dual point \(\vec{c} \in \mathbf{D}\) lies above the graph of \(H \subset \mathbf{D}\)
        if and only if the real point \(\cR (H) \in \mathbf{R}\)
        lies below the graph of \(\cR(\vec{c}) \subset \mathbf{R}\).
        In symbols
        \[\vec{c} \succ H \Leftrightarrow \cR(\vec{c}) \succ \cR(H).\]
        \item 
        Points \(\vec{c}, \vec{c}'\) that differ only in the \((d+1)\)-th coordinate
        (lie exactly above/below each other)
        correspond precisely to parallel planes (both under \(\cR\) and \(\cR^{-1}\)).
    \end{enumerate}
\end{proposition}

The next proposition shows another property of the duality, crucial to our framework.

\begin{definition}\label[definition]{upper-hull}
    Let \(S \subset \R^{d+1}\) be a finite set of points.
    The convex hull of \(S\) will be denoted \(\mathcal C(S)\).
    Furthermore, we will call
    the set of points
    \[\{(\vec{x}, y) \in \mathcal C(S)\ |\
    (\vec{x}, y + \epsilon) \not\in \mathcal C(S)\ \text{for any}\ \epsilon>0\}\]
    the \emph{upper hull} of \(S\) and denote it 
    \(\U(S)\).
    Finally, the set of vertices of \(\U(S)\)
    will be denoted \(\U^*(S)\).
\end{definition}

\begin{proposition}\label[proposition]{max-hull}
    Let \(S \subset \mathbf{D}\) be a finite set of points.
    Then, for every point \(\x \in \mathbf{D}\) lying below \({\U}(S)\),
    we have (in \(\mathbf{R}\))
    \[\cR(\x) \leq \max\{\cR(\vec{s})\ |\ \vec{s} \in {\U}(S)\},\]
    i.e. the affine function in \(\mathbf{R}\) dual to \(\x\)
    lies fully below the maximum of the affine functions whose duals lie on \({\U}(S)\).
\end{proposition}

\Cref{eg-max-hull} gives us a useful correspondence%
---each CPA function can be represented uniquely
as an upper-convex hull in the dual space.
This allows us to implicitly simplify the notation as well,
as illustrated in \Cref{eg-max-hull}.

\begin{example}\label[example]{eg-max-hull}
    Let us consider the function
    \[f(x) = \max\Big\{-x+3,\: -\tfrac{1}{2}x+2,\: \tfrac{1}{2}x,\: x-2,\: 0\Big\}.\]
    \cref{fig:eg-max-hull} draws it in both the real and dual space.
    \begin{figure}[ht]
        \centering
        \includesvg[width=0.48\textwidth]{graph0.svg}
        \caption{Real and dual diagrams in \cref{eg-max-hull}.}
        \label[figure]{fig:eg-max-hull}
    \end{figure}
    
    We can see that the points \((-\tfrac{1}{2}, 2), (0,0) \in \mathbf{D}\)
    corresponding to the functions \(y=-\tfrac{1}{2}x + 2\) and \(y=0\)
    lie respectively on and under the upper hull of the other points. This means that the functions \(y = -\tfrac{1}{2} x + 2, y=0\) never exceed the maximum of \(-x+3, \tfrac{1}{2}x,\, x-2\), but \(y=-\tfrac{1}{2}x + 2\) matches it at some point.
    
    In particular, we can write the maximum using just three of the functions.
    \begin{align*}
       & \max\Big\{-x+3,\: -\tfrac{1}{2}x+2,\: \tfrac{1}{2}x,\: x-2,\: 0\Big\}\\
       &= \max\Big\{-x+3,\:  \tfrac{1}{2}x,\: x-2\Big\}
    \end{align*}
\end{example}
\medskip


\subsection{ReLU Networks in the Context of Tropical Geometry}

This section shows precisely how to generate the dual diagram of a function defined by a neural network.

Let us denote by \(F_l: \R^{d} \to \R^{w_l}\) the function defined by the network taking the input to the post-activation values on the \(l\)-th layer (here \(w_l\) is the width of the \(l\)-th layer). This means that
\[F_l(\x) = \mathbf{\sigma}(A_l F_{l-1}(\x)).\]

Let us assume that \(F_{l-1} = \cR(P_{l-1}) - \cR(N_{l-1})\) for \(P_{l-1}\) and \(N_{l-1}\) being \emph{vectors} (ordered tuples) of sets of points.
We want to write \(F_l = \cR(P_l) - \cR(N_l)\) for \(P_l\) and \(N_l\) computed in terms of \(P_{l-1}\) and \(N_{l-1}\). For this, we need to introduce some notation.

\begin{definition}
    Given  sets of points \(X, Y \subset \mathbf{D} \cong \R^{d+1}\), we define
    \begin{itemize}
        \item \(X \oplus Y = \{\x + \vec{y}\ |\ \x\in X, \vec{y}\in Y\}\) to be the \emph{Minkowski sum of \(X\) and \(Y\)};
        \item \(X \cup Y\) to be the standard union of \(X\) and \(Y\) as sets.
    \end{itemize}
\end{definition}
We also define these operations on vectors of sets of points to be the coordinate-wise operations. 
These have important interpretations in our correspondence.

In the following, for a finite set \(X \subset \mathbf{D}\)
we identify \(\cR(X)\) with the function \(\max\{\cR(\vec{x})\ |\ \vec{x} \in X\}\)
being a maximum of hyperplanes in \(\mathbf{R}\).

\begin{proposition}\label[proposition]{basic_corr}
    For any sets of points \(X, Y \subset \mathbf{D}\), we have
    \begin{itemize}
        \item \(\cR(X \cup Y) = \max\{\cR(X), \cR(Y)\}\);
        \item \(\cR(X\oplus Y) = \cR(X) + \cR(Y)\).
    \end{itemize}
\end{proposition}

\begin{proof}
    The first one is clear from the definition. For the second one, we have
    \begin{align*}
        &\max\{x_1, \ldots, x_n\} + \max\{y_1, \ldots, y_m\} \\
        &= \max\{x_1+y_1, x_1+y_2, \ldots, x_n+y_m\}.
    \end{align*}
\end{proof}

Now, we need to define matrix multiplication for vectors of sets of points.

\begin{definition}
    Given \(S\subset\mathbf{D}\),
    we define the scalar multiplication \(\lambda\cdot S\) in the usual way.
    For a vector \(X = (X_i)_{1 \le i \le n}\) of sets of points in the dual space
    and for an \(n \times m\) matrix \(A\)
    we define the \emph{Minkowski matrix product of \(X\) by \(A\)} through
    \[(A\otimes X)_i = \bigoplus_{j=1}^{n} A_{ij} \cdot X_j.\]
\end{definition}

Notice that we could run into problems with just using the Minkowski operations,
since as long as \(S\) has at least 2 points,
we will have \(2\cdot S \neq S \oplus S\).
However, if we restrict ourselves to the vertices of upper convex hulls
and non-negative matrices the operations are `well-behaved'.

\begin{proposition}\label[proposition]{basic-properties}
    For matrices \(A,B\) with non-negative values and vectors of points \(X, Y_1, Y_2\),
    the following hold.
    \begin{itemize}
        \item \(\U^*\big((A+B)\otimes X\big) = \U^*\big((A \otimes X) \oplus (B \otimes X)\big)\);
        \item \(A \otimes (Y_1 \oplus Y_2) = (A\otimes Y_1) \oplus (A\otimes Y_2)\);
        \item \(AB\otimes X = A\otimes(B\otimes X)\);
        \item \(X \oplus (Y_1 \cup Y_2) = X \oplus Y_1 \cup X \oplus Y_2\).
    \end{itemize}
\end{proposition}

This seems useful, but quite restrictive, since we need to operate with non-negative matrices. However, every matrix \(A\) can be written as a difference between its positive part and its negative part \(A = A^+ - A^-\), where both \(A^+\) and \(A^-\) are non-negative.

We also have an interpretation for the matrix multiplication, similar to \cref{basic_corr}.
Here, when passing a vector of sets of points to the operator \(\cR\),
we apply it coordinate-wise getting a vector of maximums of affine functions.

\begin{proposition}\label[proposition]{matrix_otimes_points}
    Given a vector \(X\) of sets of points in \(\mathbf{D}\)
    and a non-negative matrix \(A\), we have
    \[A\: \cR(X) = \cR(A\otimes X).\]
\end{proposition}
\begin{proof}
    \[[A\:\cR(X)]_i = \bigoplus_j A_{ij}[\cR(X)]_j = \bigoplus_j [\cR(A_{ij}X_j)] \] \[ = \cR(\oplus_j A_{ij}X_j) = \cR([A\otimes X]_i) = [\cR(A\otimes X)]_i\]
\end{proof}
We can now characterise the function \({F_l = \cR(P_l) - \cR(N_l)}\)
in terms of vectors of points \(P_{l-1}\) and \(N_{l-1}\).
\begin{proposition}\label[proposition]{explicit}
    Let's assume that
    \(F_l = \sigma(A_l\: F_{l-1})\) and \(F_{l-1} = \cR(P_{l-1})-\cR(N_{l-1})\).
    Then, after writing \({A_l = A_l^+ - A_l^-}\),
    we get \(F_l = \cR(P_l) - \cR(N_l)\) for
    \[N_l = (A_l^-\otimes P_{l-1}) \oplus (A_l^+\otimes N_{l-1})\]
    \[\textrm{and} \quad P_l =(A_l^+\otimes P_{l-1})\oplus(A_l^-\otimes N_{l-1}) \:\cup\: N_l.\]
\end{proposition}
Proposition \ref{explicit} is the key to our counting algorithm. Given a neural network, we apply it to all the layers successively, and in the end we obtain a representation of the NN as a DCPA function. Having a DCPA form, we can use proposition \ref{new-hard} and \ref{corr_affine_pc_count} to count the number of boundary and affine pieces.



\subsection{Tropical Hypersurfaces}\label[section]{duals}

In this section, we explore the regions into which a CPA function partitions the plane, which is called the \emph{tessellation} of a CPA. We define it formally below.

\begin{definition}
    Given a CPA
    \[F(\x) = \max\{f_1(\x), \ldots, f_n(\x)\}\]
    where \(f_i\) are affine functions, an \emph{affine region of \(F\)} is
    \[\Big\{\x \in \R^d \: \Big| f_i(\x) = f_{i'}(\x) > f_j(\x) \textrm{ for all } i, i' \in I,  j \in J \Big\},\] 
    where \(I, J\) are disjoint sets whose union is \(\{1, \dots, n\}\).
    Its \emph{dimension} is the smallest dimension
    of an affine subspace of \(\R^{d}\) containing it.
    The set of all regions of dimension \(k\) (\(k\)-cells)
    will be denoted as \(\T_k(F)\), and \(\T(F) = \bigcup_k \T_k(F)\).
\end{definition}
For a set of points \(S\) in the dual space we will denote by \(\T(S)\)
the tessellation of \(\cR(S)\).
For example, \(\T_0\) is the set of all vertices of \(\T(S)\),
\(\T_1\) is the set of all its lines, rays and segments.

\begin{proposition}\label[proposition]{scary}
    \(k\)-cells of \(\T(S)\) are in one-to-one correspondence with \((d-k)\)-cells of \(\U(S)\).
    Each \(k\)-cell \(\sigma\) of \(\T(S)\) is of the form
    \[p(\cR(\text{dual planes tangent to }\U(S)\text{ containing }\sigma')),\]
    where \(\sigma'\) is a \((d-k)\)-cell of \(\U(S)\),
    and \(p : \R^d \times \R \rightarrow \R^d\) is the projection onto first \(d\) coordinates.
\end{proposition}
By \(H\) being \emph{tangent} we mean that the whole of \(\U(S)\) lies \underline{under or on} \(H\)
and that \(H \cap \U(S) \ne \emptyset\).

% \todo{Do an example here.}

%%%%%%%%%%%%

\subsection{Decision boundary}
 Let \(F = \cR(P)\) and \(G = \cR(N)\) be CPA functions \(\R^d \to \R\). We are interested in being able to describe the zero set \(D\) of a DCPA function \(F-G\). 
The proposition below expands on the idea of Proposition 6.1 in \citep{zhang2018tropical}.

\begin{proposition}\label[proposition]{new-easy}
    Let us assume that no points of \(P\) lie on \(\U(N)\) and vice versa. 
    The set \(D\) is a union of precisely these \({(d-1)}\)-dimensional cells of \({\T(P\cup N)}\) which correspond to the edges of \(\U(P\cup N)\) with one end in \(P\) and the other end in \(N\).
\end{proposition}

This means that to draw the decision boundary, all we have to do is draw the hypersurface \(\T(P\cup N)\) and identify which cells come from the intersection of the graphs of \(\cR(P)\) and \(\cR(N)\).

% \begin{eg}
%     \todo{Put an example here.}
% \end{eg}

\Cref{new-easy} deals with the case most likely to happen in general situations, but it is possible that some points of \(P\) lie on \(\U(N)\) or vice versa. \Cref{new-hard} describes this more difficult case too. We compute the boundary count of a neural network by applying \ref{new-hard} to the DCPA representation of a NN (from proposition \ref{explicit}).
% \idea{Maybe we don't actually need a separate proposition for that?}

\begin{proposition}\label[proposition]{new-hard}
    Let \(F = \cR(P), G = \cR(N)\) be CPA functions. Then the zero set \(D = \{\x \in \R^d \: |\: F(\x) = G(\x)\}\) consists precisely of this cells of \(\T(P \cup N)\), which correspond to the cells of \(\U(P\cup N)\) containing points from both \(P\) and \(N\).
\end{proposition}

\subsection{Affine pieces}

Our formalism also allows us to count the exact total number of affine pieces. To do this for a neural network, we apply the corollary \ref{corr_affine_pc_count} to the DCPA form obtained from proposition \ref{explicit}.

\begin{corollary}\label{corr_affine_pc_count}
    The number of affine pieces (\(d\)-cells) of a DCPA function \(\cR(P) - \cR(N)\) is equal to the number of vertices of \(\U(P \oplus N)\).
\end{corollary}

Corollary~\ref{corr_affine_pc_count} is a special case of a more general result stated below.

\begin{proposition}\label{DCPL_affine_pieces}
    Each \(k\)-cell \(\sigma\) of \(\cR(P) - \cR(N)\) is of the form \[\sigma \! = \! p(\cR(\text{hyperplanes tangent to }\U(P \oplus N)\text{ containing }\sigma'))\] where \(\sigma'\) is a \((d-k)\)-cell of \(\U(P \oplus N)\). The correspondence \(\sigma \leftrightarrow \sigma'\) is bijective.
\end{proposition}
To the best of the authors' knowledge, this explicit formula for counting the total number of affine pieces has not been spelled out in existing literature, where the scaling of the count with respect to neural network structures is usually the focus. 
%\todo{Degeneracy ...}

% \begin{remark}[Computation complexity]
%     Roughly the scaling with network size and input dimension.
% \end{remark}

% There are two key factors governing the difficulty of a classification problem: 1) The complexity of the decision boundary, i.e., the set $\cG^*$ where the optimal $G^*$ resides; 2) How separated the data are around the decision boundary. 
% This two rules are well-established in statistics literature with sharp analysis. To illustrate, the boundary complexity is usually characterized by the bracketing entropy of $\cG^*$ with parameter $\rho>0$ (the larger the $\rho$, the larger the hypothesis class) and the separation is usually characterized by Tsybakov's noise condition \citep{mammen1999smooth} with parameter $\kappa$ (the larger the $\kappa$, the more separated). The optimal convergence rate for the 0-1 loss excess risk is shown to be $O(n^{-\frac{\kappa+1}{\kappa+2+\kappa\rho}})$.

\begin{remark}
    In ReLU neural networks it is possible to have a degenerate situation, where on two regions the network computes the same affine function, but these regions differ in activation patterns. Our approach will see such regions as separate. We do not know of any literature where this would be treated differently.
\end{remark}

\section{Numerical Experiments}
In this section, as a proof of concept, we conduct numerical experiments on 2D synthetic data. The aim of this section is two-fold. 
Firstly, we compare the proposed boundary complexity (\#Boundary) to various other complexity measurements, e.g., the total number of affine pieces (\#Total), the sum of weights squared (F-norm), and evaluate their trends during training. 
The results show that our boundary complexity is quite unique, with distinctive features. 
Secondly, we demonstrate a negative correlation between the number of boundary pieces and classification robustness, where popular robust training methods, specifically noise injection and adversarial training, can both diminish the number of boundary pieces.

We choose ReLU neural networks with 2 hidden layers of different widths across all our simulations.
Three training schemes are considered: regular training with cross-entropy (CE), CE with Gaussian noise injection (Noisy), and CE with $l_\infty$-adversarial training by fast gradient sign attacks \citep{goodfellow2014explaining} (Adv). 
Two synthetic datasets are constructed in 2-dimensional space, one is 3-by-3 Gaussian mixture (Figure \ref{fig:gaussian}) and the other is spiral-shaped (Figure \ref{fig:spiral}). The Gaussian case provides a baseline while the spiral case is much more challenging and may better reflect complicated data structures in practice. 
To measure robustness, we choose Gaussian distributed random noise injection with standard deviation $\sigma$. 2000 test points are used to approximate the expectation and this empirical robustness measure is denoted (in percentile) by $R(\sigma)$.
% For the real data, we consider the MNIST dataset, which is 784-dimensional and the exact count is prohibitively computationally intensive. Thus, we investigate random 2-dimensional slices.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.48\textwidth
    ]{figures/gaussian}
    \caption{Decision boundaries in the 3$\times 3$ Gaussian mixture case in $[-2, 2]^2$. From left to right are instances of CE (\#Boundary=46), Noisy (\#Boundary=41), Adv (\#Boundary=40), respectively. }
    % \vspace{-3mm}
    \label{fig:gaussian}
\end{figure}


The quantities at initialization are shown in Table \ref{tab:robust1} and Table \ref{tab:robust2}. 
We can see that the initial \#Boundary is usually much smaller, with larger variations. This is to be expected as the boundary is only a level set of the initialized classifier, which can be very sensitive to constant shifts.
The initial \#Total is usually larger. This is interesting and indicates that the initial classifier is more random in terms of linear region arrangement.
Like \#Boundary, the F-norm at initialization is much smaller, but with much smaller variations. This is to be expected as the F-norm is directly linked to initialized weights.


\subsection{Trends During Training}
For different tasks, we can observe the overall trend for \#Boundary to be: first increase, then decrease and finally stabilize. 
Similar behaviors can also be observed for \#Total and F-norm during training, but their movements are not synchronized. 
Among the training methods, the overall trends share more similarities than differences, except for with or without weight decay. 
Typical instances are shown in Figure \ref{fig:trend} and \ref{fig:trend2}. 

\begin{figure}[ht]
    \centering
     \includegraphics[width=0.49\textwidth
    ]{figures/trend0}
    \caption{Training trends of \#Boundary (red), \#Total (green), F-norm (red) vs. iteration in the 2D spiral case. Left: CE with weight decay; Right: CE without weight decay. }
    \label{fig:trend}
\end{figure}

\begin{figure}[ht]
    \centering
     \includegraphics[width=0.49\textwidth
    ]{figures/trend1}
    \caption{Training trends of \#Boundary (red), \#Total (green), F-norm (red) vs. iteration in the 2D spiral case. Left: Noisy with weight decay; Right: Adv with weight decay. }
    \label{fig:trend2}
\end{figure}

% \paragraph{\#Boundary vs others}
\paragraph{\#Boundary vs others.}
% For ReLU networks, \#Total and \#Boundary are a perfect example of function complexity and boundary complexity. The two quantities are related but can behave differently during training. 
The left figure in Figure \ref{fig:trend} shows the typical trends in the Noisy case with weight decay, where we can clearly see that \#Boundary lags behind the others. 
When the training starts, \#F-norm and \#Total peak much earlier than \#Boundary.  
In most cases, we observe that F-norm peaks first, then \#Total, and lastly \#Boundary. 
When robust training is applied (Noisy, Adv), the gaps among them widen. 
In the later stage, F-norm stabilizes much faster than the others, while we can consistently observe that \#Boundary flattens slower than \#Total. Overall, \#Boundary appears to change much slower than the others, taking more time to peak, and more time to plateau. 


\paragraph{Role of weight decay.}
The right figure in Figure \ref{fig:trend} shows a typical trend in the CE case without weight decay, which demonstrates drastically different behaviors. 
\#Boundary and \#Total plateau much earlier and do not change much once the classifier has overfit the training data. In comparison, F-norms keep getting larger, which is to be expected due to the use of cross-entropy loss. 
Weight decay is found to play an important role in the forming of ReLU networks' geometric structures. This is surprising as naively shrinking a ReLU network does not change its affine piece arrangement.


\subsection{Classification Robustness}
% To evaluate the correlation of different complexity measurement and robustness, we replicate the CE experiment in the 2D spiral task for 20 times. For each run, we evaluate the classifier's robustness to Gaussian noise injection. In view of \ref{eqn:dfn}, $\bdelta\sim N(0, \sigma^2\bI_2)$ and we sample 2000 test samples to approximate the probability.  
% The Pearson's correlation is reported in Table \ref{tab:corr}. (depending on how significant is the result, this part may be removed...) 

% \begin{table}
%     \centering
%     \begin{tabular}{l|cc}
%     \hline
%      & Robustness & \#Boundary  \\
%     \hline
%         \#Boundary &   & 1\\
%         \#Total    & 0.35 (0.31) &\\
%         F-norm    & -0.27 () & \\
%         Accuracy &  () & \\
%       \hline
%     \end{tabular}
%     \caption{Pearson's correlation. The p-value is reported in the parenthesis.}
%     \label{tab:corr}
% \end{table}

In this section, we aim to investigate the relationship between robustness and \#Boundary. However, in the absence of practical algorithms to regularize the boundary complexity, we turn to popular robust training methods and evaluate whether they can significantly reduce \#Boundary. 
Results for the Gaussian mixture and spiral case are reported in Table \ref{tab:robust1} and Table \ref{tab:robust2}, respectively. 
% For measuring robustness, $\sigma$ is chosen to be 2 times the robust train

% \paragraph{Square Loss}
% It has been reported in \citep{hu2021understanding} that using SL instead of CE can improve the classification robustness. 
% Under the same training setting, we observe that the SL tends to produce less boundary pieces, as well as total number of affine pieces. 

% \paragraph{Noise injection}
\begin{table}
    \caption{Comparison of boundary piece counts in the Gaussian mixture case for ReLU network with layer widths 2-10-10-1. The reported number is an average (standard deviation) of 10 repetitions. 
    }
    \centering
    \scriptsize
    % \begin{tabular}{l|ccccc}
    % \hline
    %  & \#Boundary  & \#Total  & F-norm & Acc\% & R$(0.2)$\\
    % \hline
    %     CE & 43 (9) & 190 (32)  & 57 (3) & 100 (0)& 96.4 (0.6) \\
    %     Noisy    & 41 (9) & 216 (56) & 67 (3) & 100 (0)& 96.8 (0.5) \\
    %     Adv & 36 (7) & 172 (44) & 73 (4) & 100 (0) & 97.2 (0.5)\\
    %   \hline
    % \end{tabular}
    % \vspace{4pt}
    \begin{tabular}{l|ccccc}
    \hline
     & \#Boundary  & \#Total  & F-norm & Acc\% & $R(0.2)$\\
    \hline
    Initial & 29 (17) & 290 (29)   & 6.8 (0.61) & 50.1 (1.1) & -\\ 
    \hline
        CE & 43 (5.3) & 190 (24)   & 57 (3.6) & 100 & 96.4  \\
        Noisy    & 41 (3.1)  & 216 (26)  & 67 (2.6)  & 100 & 97.0  \\
        Adv & 36 (4.6)  & 172  & 73 (2.1) & 100  & 97.2 \\
      \hline
    \end{tabular}
    \label{tab:robust1}
\end{table}
\begin{table}
    \caption{Comparison of boundary piece counts in the 2D spiral case for ReLU network with layer widths 2-30-30-1. The reported number is an average (standard deviation) of 10 repetitions.}
    \centering
    \scriptsize
    % \vspace{4pt}
    \begin{tabular}{l|ccccc}
    \hline
     & \#Boundary  & \#Total  & F-norm & Acc\% & $R(0.02)$\\ \hline
      Initial & 90 (61) & 2432 (179)   & 20 (0.71) & 50.2 (1.2) & -\\ \hline
        CE & 377 (31) & 1915 (207)   & 283 (11) & 93.60 (1.8) & 94.3 (2.2) \\
        Noisy    & 272 (33)  & 1493 (114)  & 322 (17) & 99.15 (0.56) & 98.1 (0.51)\\
        Adv & 259 (21)  & 1241 (135)  & 356 (19) & 99.35 (0.38) & 98.9 (0.36) \\
       \hline
    \end{tabular}
    \label{tab:robust2}
\end{table}


In the simpler Gaussian mixture case, the strength for Noisy and Adv are both set at $0.1$, the same as the variance of each mixing component. 
Figure \ref{fig:gaussian} shows the decision boundaries for CE, Noisy and Adv. Despite the apparent visual difference, the \#Boundary does not differ that much. In Table \ref{tab:robust1}, we can observe \#Boundary to be smaller on average for Noisy and especially Adv. 

The effects of Noisy and Adv become more significant in the harder, more challenging spiral case. CE does not perform as consistently as Noisy or Adv and sometimes will miss the spiral shape.  
The strength for Noisy and Adv are both set at $0.01$, which is roughly the size of the margin. As can be seen from Table \ref{tab:robust2}, both \#Boundary and \#Total significantly dropped while F-norm stays relatively on the same level. 

On both datasets, compared with CE, Noisy and Adv have strong effects on reducing the boundary complexity. The same is not true for function complexity such as F-norm. 




\section{Discussion}
We advocate that proper regularization on the decision boundary is of critical importance to classification. As a proof of concept, we choose the number of linear pieces of ReLU networks to measure the boundary complexity, due to its well-definedness. 
The main technical contribution is the explicit formula to count the exact number of boundary pieces as well as total affine pieces. Empirical evaluation and justification are made on synthetic data and interesting properties of the boundary piece count are revealed. 


\paragraph{Limitations and extensions.}
(1) While the main focus of this work is on rectified linear units, our method can easily be extended to leaky ReLU activation, and basically all other piecewise linear functions. 
(2) In the experiments, we only evaluated binary classification. However, it is also quite straightforward to count the boundaries between any two given classes in the multi-class classification scenario. 
(3) In the present form, the computation scaling with respect to the network size is impractical for large models, especially with input dimension and depth. The most time-consuming part is the Minkowski sum. However, most of them do not directly contribute to the level set. We believe that further optimizations could shed more light on the mechanics of training procedures. Moreover, incorporating differentiability would give a penalty term that regularizes a previously unaddressed aspect of the network. 
(4) Though intuitive, the number of boundary pieces may not be the best choice for the complexity measurement in classification, since it doesn't take finer details such as piece arrangement into consideration. How to better quantify boundary complexity remains an open question. 

% Secondly, due to the discrete nature, we do not have differentiability, nor a penalty term that regularizes a previously unaddressed aspect of the network. 
% On the other hand, from Table \ref{tab:robust2}, we can see that the final \#Boundary and \#Total are higher correlated. 
% Though they have different trends during training, the two quantities are deeply connected and training methods considered in this work cannot separate them. % \#Total leads to \#Boundary


\paragraph{Regularizing the boundary complexity.}
Given a measurable boundary complexity, regularizing it during the training process can be challenging.
Adversarial training or noise injection can act as a regularization for boundary complexity, as verified in our experiment.  
Defining suitable boundary complexity measurement and proposing direct and more efficient ways to control it is an open question. 
The aim of this work is to identify such an important problem and convince the readers that boundary complexity is indeed proper to regularize for classification robustness. 
Such regularization is not at odds with other established methods, but a healthy complement to existing literature. 
The level set sampling method proposed in \citet{atzmon2019controlling} may be a good starting point. Uncovering the link of our work to persistent homology \cite{chen2019topological} is also interesting. 
% and one can ask the question: given points from the level set, how to empirically define its complexity? 
We hope that further work will lead to achieving our ultimate goal -- designing practical and scalable algorithms for effective regularization and thus improving state-of-the-art performance in classification. 

%%%%%%%%%%%%%%


\bibliography{reference}

%%%%%%%%%%%%%%

\end{document}