%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Non smooth optimization}
\label{sec:nso}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Application}

We optimize function Eq. \ref{eq:Feta} using the methods developed in
\cite{griewank2013stable} and \cite{griewank2016lipschitz}, which
belong to the realm of non-smooth analysis \cite{clarke1997nonsmooth}.


%% As already discussed in the previous sections, we define the
%% selection functions $f_\sigma$ as the function that is `active' in
%% the region with signature $\sigma$ i.e.
%%
Using the signatures $P_i(x)$ of Eq. \ref{eq:Pi-sig},
the objective function $\Feta$ rewrites 
as the following  sum of {\em active} functions:
%%
\begin{equation}
\label{eq:Feta2}
\sigma = (\sigma_1,\sigma_2\dots, \sigma_n) \in \{0,1\}^n \implies 
f_\sigma = \sum_{\sigma_i = 1} \tilde{f}_{\eta,x_i}.
\end{equation}

Before optimizing Eq. (\ref{eq:Feta2}), we recall basics of non-smooth
analysis.

\subsection{Background}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\paragraph{Subgradient and subdifferential.}
For a differentiable convex function $f$, the following equation holds. 
\begin{equation}
    \label{underestimate}
    f(x) \geq f(x^*) + \nabla f(x^*)^T (x-x^*)  \text{ for all x in the domain of } f  
\end{equation}
This linearization of $f$ gives a lower bound on $f$, and is unique for a given $x^*$. However, for a non differentiable convex function, such a linearization is not unique for all points in the domain of $f$. For this reason, we define a subgradient in the analogous way, giving a linearization of $f$ which is a lower bound of $f$.

\begin{definition}[Subgradient]
    For a function $f:\Rd \to \Rnt$, the vector $s$ is said to be a subgradient of $f$ at the point $x^*$ if 
    \begin{equation}
        \label{subgradient:defn}
        f(x) \geq f(x^*) + s^T (x-x^*)
    \end{equation}
\end{definition}

\begin{definition}[Subdifferential]
    The subdifferential of $f$ at the point $x^*$ is the set of all subgradients of $f$ at $x^*$, and is represented as $\partial f(x^*)$.
\end{definition}

At a point $x^*$ where $f$ is differentiable, we can see that $\partial f(x^*) = \{\nabla f(x^*)\}.$

\begin{theorem}
Let $f : \Rd \to \Rnt$ be a convex function which is bounded below, then the set $\partial f(x^*)$ is non empty for any $x^*$ in the interior of the domain of $f$. Moreover, it is a compact and convex set.
\end{theorem}

\paragraph{Limiting gradient.}
As the name suggests, the limiting gradient is the set of the limits of the gradient of the function, 
in an infinitesimally small neighborhood of the point of interest:
%%
\begin{definition}[Limiting gradient]
\label{def:limitinggradient}
The limiting gradient of a function $f$ at the point $x$ is denoted by $\partial ^ L f(x)$ and is defined as 
\begin{equation}
\partial ^ L f(x) := \{\nabla f_\sigma(x) : f_\sigma(x) = f(x)\}
\end{equation}
\end{definition} 
%%
Note that the previous definition uses all the expressions of function $f$
in the said neighborhood, complying with the continuity constraint.
Of course, when  $F_\eta(c)$ is differentiable, one has $\partial ^L F_\eta(x) = \{\nabla F_\eta(x)\}$.

The  hyperplanes which are `between' the limiting gradients are characterized via the
notion of {\em subdifferential}, characterized by the following Thm:
%%
\begin{theorem}
    Given the limiting gradient of a function at a point $x$, we can find its subdifferential as follows. 
    \begin{equation}
        \label{thm:findsubdifferential}
        \partial f(x) = \text{conv}(\partial ^L f(x))
    \end{equation}
\end{theorem}
\toblack

\paragraph{Subgradient and subdifferential.}

For pragmatic purposes which shall be clear later, we define a directionally active gradient (as defined in \cite{griewank2016lipschitz}) which is just the directional derivative of a function at a point.

\begin{definition}[Directionally active gradient] \label{def:directionallyactivegradient}
    A directionally active gradient $g(x,d) \in \partial^L f(x)$ such that $f'(x,d) = g^Td$ and $g(x,d)$ equals the gradient of a selection function which coincides with $f$ on a set, whose tangent cone at $x$ contains the direction $d$ and has a non empty interior.    
\end{definition}

Another useful operation is the computation of the minimum distance
from a point to a convex set defined as a convex hull.  More
specifically, $\algShortDistToP{h,G}$ represents the vector of minimum
length which connects the point $h$ and a point which is a convex
combination of points in the set $G$ (Fig. \ref{fig:short} and \cite{wolfe1976finding}):
%%
\begin{figure}
    \centering
    \includegraphics[width = 0.5\textwidth]{fig/shortest-distance-to-convex-hull-polytope.pdf}
    \caption{{\bf Shortest distance from point to a convex hull.} 
$d = \algShortDistToP{x, conv(-g_1, -g_2, -g_3)}$: 
$d$ is the vector of the shortest length pointing from $x$ to the convex hull of $-g_1$, $-g_2$ and $-g_3$.}
    \label{fig:short}
\end{figure}

\begin{equation}
\label{def:short}
\algShortDistToP{h,G} = \operatorname*{argmin} \ \left\{ \vvnorm{d} : d = \sum _ {g_j \in G} \lambda_j g_j - h, 
\ g_j \in G, \lambda_j \geq 0, \sum_j \lambda_j = 1 \right\}
\end{equation}
%%
In non-smooth analysis, stationary points are characterized by
\begin{equation*}
0\in \partial  f(x) \Leftrightarrow 0 = d(x) =   -\algShortDistToP{0, \partial f(x)}.
\end{equation*}
At non stationary points, the unique direction of steepest descent is given by $d(x)/\vvnorm{d(x)}$.


% \cite{griewank2013stable}
% \cite{griewank2016lipschitz}
% \cite{wolfe1976finding}

\subsection{Overall algorithm: pseudo-code}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

For the sake of completeness, we provide the pseudo-code from
\cite{griewank2013stable} and \cite{griewank2016lipschitz}, used to
minimize $\Feta$. The pseudo-code also uses algorithmic
differentiation -- subroutine \algDer, see
e.g. \cite{schmidt2022tinyad}.


\begin{algorithm}
    \begin{algorithmic}[1]
        \Require{$f$ the convex, continuous, non-differentiable function to be optimized}
        \Require{$x$ an initial guess for the answer}
        \Require{$q$ a quadratic error coefficient, initially chosen arbitrarily}
        \Require{$iter$ the maximum number of iterations}
        \Require{$iter\_inner$ the maximum number of iterations of the inner loop}
        \Statex

        \Procedure{\algOptimize}{$f,x,q,iter, iter\_ inner$}

            \For{$i$ in range $1$ to $iter$}{}
            
            \State{$\Delta x \gets  \algPLMin \text{ } ($\Call{\algDer}{$f,x,s$}$ + \frac{q}{2}\cdot \|s\|^2,x,q,iter\_inner)$} \Comment {Minimising the function in $s$}

            \If{$\| \Delta x \| = 0$} \State{break} \Comment{Minimum found}
            \EndIf
            
            \If{$f(x + \Delta x) < f(x)$} \State{$x \gets x + \Delta x$} \Comment{Decide whether to take the step}
            \EndIf
            
            \State{$\tilde{q} \gets \frac{2 \cdot \| f(x+\Delta x) - f(x) -\Call{\algDer}{f,x,\Delta x}\|}{\| \Delta x\|^2}$}
            
            \State{$q \gets \max{(q,\tilde{q})}$} \Comment{Update the quadratic error coefficient}
            
            \EndFor
            
            \State \Return{$x$}
        \EndProcedure

    \end{algorithmic}
    \caption{{\bf Minimizing a convex non-smooth function $f$. From \cite[Section 5.1]{griewank2013stable}.}}
%% xfc: removed: An algorithm to minimize our objective function $F_\eta(c)$, taking care of the non differentiabilities.}
    \label{alg:optimize}
\end{algorithm}



\begin{algorithm}
    \begin{algorithmic}[1]
        \Require{$f$ the function which we are minimising}
        \Require{$x$ the current position}
        \Require{$q$ the quadratic error coefficient}
        \Require{$iter$ the maximum number of iterations}
        \Statex

        \Procedure{\algPLMin}{$f,x,q,iter$}

            \State{$d = rand()$}
            \State{$G = \phi$}
            \For{$i$ in range $iter$}{}
            \State{$g \gets g(x,d)$} \Comment{Directionnally active gradient $g(x,d)$: Def. \ref{def:directionallyactivegradient}}
            \State{$G \gets G \cup \{g\}$}
            \State{$d \gets \algComputeStep{(f,x,q,G)}$} 
            \If{$\|d\| = 0$}
                \State{break}
            \EndIf
            \State{$\tau \gets \algCritMult{(f,x,d)}$}
            \State{$x\gets x + \tau$}
            \State{Eliminate all $g \in G$ with $\sigma(g) \nsucc \sigma(x)$}
            \EndFor
            \State \Return{$x$}
        \EndProcedure
    \end{algorithmic}
    \caption{{\bf Minimization of a continuous piecewise linear function with a quadratic error term : \cite[Algo. 4]{griewank2016lipschitz}}}
    \label{alg:plmin}
\end{algorithm}



\begin{algorithm}
    \begin{algorithmic}[1]
        \Require{$f$ the function which we are minimising}
        \Require{$x$ the current position}
        \Require{$q$ the quadratic error coefficient}
        \Require{$G$ a subset of the limiting gradient of the function at $x$, where $G$ is non-empty}
        \Statex

        \Procedure{\algComputeStep}{$f,x,q,G$}

            \Repeat
                \State{$d \gets -\Call{\algShortDistToP}{qx\toblack,G}$} \Comment{$\algShortDistToP$:  Def. \ref{def:short}; no $qx$ anymore}
                \State{$g\gets g(x,d)$} \Comment{$g(x,d)$: Def. \ref{def:directionallyactivegradient}}
                \State{$G = G \cup {g}$}
            \Until{$g^Td \leq-\|d\|^2$}

            \State{eliminate all $\tilde{g} \in G$ with $\tilde{g}^Td \neq g^Td$}
            \State \Return{$d$}
        \EndProcedure

    \end{algorithmic}
    \caption{{\bf Compute Step : \cite[Algo. 2]{griewank2016lipschitz}} 
An algorithm to compute the direction of the next step to optimize our function.}
    \label{alg:computestep}
\end{algorithm}

\begin{algorithm}
    \begin{algorithmic}[1]
        \Require{$f$ the function which we are minimising}
        \Require{$x$ the current position}
        \Require{$d$ is the step direction}
%%        \Require{NB: the  partial order used is defined in Eq. \ref{eq:partialorder}}
        \Statex

        \Procedure{\algCritMult}{$f,x,d$}
            \State{$\sigma \gets \sigma(x)$} \Comment{$\sigma (x)$ is the signature vector wrt the piecewise linear function}
            \State{Find the maximal $\tilde{\tau}$ such that $\sigma \preceq \sigma(x+\tau d) \text{ } \forall \text{ } 0<\tau<\tilde{\tau}$ and $\sigma \npreceq \sigma (x+\tilde{\tau})$} 
            \State \Return{$\tilde{\tau}$}
        \EndProcedure

    \end{algorithmic}
    \caption{{\bf Computation of critical multiplier: \cite[Algo. 3]{griewank2016lipschitz}} An algorithm to compute the magnitude of the next step, so as to hit the next point of non differentiability.}
    \label{alg:criticalmultiplier}
\end{algorithm}


%%% xfc check later
\begin{comment}
\subsection{Additional comments}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\subsubsection{$\Delta x$.}
%%iii--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--

Regarding the computation of $\Delta x$, we can either use existing
routines from some library if we are not so concerned with exact
number types, or use the algorithm as given in
\cite{griewank2016lipschitz} for minimising piecewise linear functions
with a quadratic error term.
\toblack

\subsubsection{Optimising piecewise linear functions with a quadratic error term}
%%iii--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--


For the sake of completeness, we outline the algorithm for minimising the algorithmic
derivative with a quadratic error term here, along with the subroutines. In this subsection, all functions in the algorithms will be of the piecewise linear with a quadratic error kind, unless stated otherwise. 



The algorithm \ref{alg:criticalmultiplier} computes the step to reach the next non differentiability in the direction of $d$. Note that the signature vectors $\sigma$ used in the algorithm are not the signature vectors for our objective function $F_\eta (c)$. Rather, they are the signature vectors associated with the regions in the local linear approximation of $F_\eta(c)$ at some point (not considering the error term, which is differentiable everywhere, after all), as we already mentioned, this section is about partial linear functions with a quadratic error term. 

The algorithm is formally stated in terms of a partial order $(\sigma,\preceq)$, where 
\begin{equation}
\label{eq:partialorder}
\sigma \preceq \sigma' \iff \sigma_i^2 \leq \sigma_i \sigma'_i \text{ for all components $i$ of $\sigma$ and $\sigma'$}
\end{equation}

Deciphering this seemingly complicated definition for the partial order, we can see that a signature $\sigma$ is `less than or equal to' a signature $\sigma'$ if for all components of the signature such $\sigma$ that are equal to $1$, we have those components in $\sigma'$ also equal to $1$.

\subsubsection{Numerics}
%%iii--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--


Of course, if we want an exact solution, then we must define custom
data types to fit our needs, since otherwise, the algorithm would
eventually succumb to numerical errors. In the previous section, the
algorithm assumes that exact numeric computations require unit time,
which is not strictly true, since we are in the domain of real
numbers, and not integers. Even if we assume that the centres of our
cells (where the corresponding point contributes a zero cost to our
objective function) have rational coordinates, and the radii of the
cells are rational, we will have to deal with algebraic numbers. These
algebraic numbers arise due to a variety of steps in the
algorithm. This inculdes the computation of $short$ which is a
distance from a point to a polytope. This may be an algebraic number
even though the point and the vertices of the polytopes have rational
coordinates. So, this means that the direction of descent may be a
vector with components that are algebraic numbers.

This is of course just half of our problem. The degrees of the
algebraic numbers may become arbitrarily large. For example, even if
the current value of $x$ is rational, the value of the shortest
distance from a point to a polytope is in general a degree 2 algebraic
number even if all vertices of the polytope have rational
coordinates. This would result in the critical multiplier itself being
an algebraic number. So, the next value of $x$ would be an algebraic
number. This would cause the value of the shortest distance to a
certain polytope in this step to be an even higher degree algebraic
number. In this way, the degree of the algebraic numbers could keep on
increasing, and thus, we need to have a dynamic data structure to
store each number, if we wish to process all computations with exact
number types.


\subsubsection{Guarantees for convergence and robustness}
%%iii--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--%--

The algorithm described in the previous subsections is more robust
than the BFGS algorithm to minimise the objective function
$F_\eta(c)$. The BFGS mostly works fine, but the issue is that the
result that we get from the algorithm might not make sense, and we
have no way of knowing if our result is correct or just another
non-sensical result. The current algorithm, on the other hand,
guarantees to converge to the global minima of the function.

In \cite[section 5.1]{griewank2013stable}, Griewank mentions that this
minimization procedure always converges to a global minimum of the
function at hand. However, no note on the minimization of the
underestimate with the quadratic error term is made. This issue is
later addressed in \cite{griewank2016lipschitz} where a method to find
the minimum of such functions is discussed and a proof which
guarantees convergence is included.
\end{comment}
