 % Modeling high dimensional point clouds with the spherical cluster model


\section{Introduction}%%  (1.5 page)}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\toblue Our work on the {\em spherical cluster model} lies at the confluence
of three topics: clustering algorithms, parametric cluster models,
and high dimensional data analysis.  \toblack

\ifLONG
\paragraph{Clustering methods.}  Clustering, namely the task
which consists in grouping data items into dissimilar groups of
similar elements, is a fundamental problem in data analysis at
large~\cite{xu2005survey}.  Existing clustering methods
may be ascribed
to four main tiers.
%%
{\em Hierarchical clustering} methods typically build a dendogram whose
leaves are the individual items, the grouping aggregating similar
clusters~\cite{rod-peh-pcsa-73}.
%%
In {\em density based clustering} methods, a density estimate is
computed from the data, with clusters associated to the
catchment basins of local maxima~\cite{cheng1995mean}. Topological persistence may be
used to select the significant maxima~\cite{chazal2013persistence}.
%%
In {\em spectral clustering} methods,  clusters are defined from the
top singular vectors of the matrix representing the data (or their
similarity)~\cite{von2007tutorial}.
%%
{\em K-means and variants} aim at grouping the data points into a
predefined set of $k$ clusters so as to minimize the sum of
intracluster variance.  Such methods aim at solving a NP-hard
optimization problem, and the so-called smart-seeding strategy
\kmeanspp provides guarantees (in terms of expectation) on the \kmeans
functional~\cite{arthur2007k}. In practice, this strategy is
superseeded by a greedy {\em inertia} based criterion which consists
of picking a seed amidst a set of candidates--see \cite{arthur2007k} and
the \sklearn implementation of \kmeanspp.
%%
These methods are related to the problem of fitting (Gaussian) mixtures using 
Expectation-Maximization \cite{dempster1977maximum,kasarapu2015minimum}.
%%
We note in passing that the variety of clustering methods prompted the
development of methods to estimate the relevant number of
clusters--\eg the {\em elbow} method \cite{ng2012clustering}, as well
as methods to compare two clusterings~\cite{cazals2019comparing}.
\medskip

%% example, the smart seeding strategy of k-means yields an algorithm
%% with an approximation guarantee~\cite{arthur2007k}; yet, k-means++
%% still suffers from instabilities when the number of centers used is
%% larger than the {\em exact} number of clusters, as the clustering
%% obtained depends on the initial distribution of centers within the
%% clusters~\cite{von2007tutorial}.

\toblue
\paragraph{Cluster models.}  A central goal of clustering is to
provide insights into the geometry of the data.
This goal prompted the development of \ksubspace clustering
techniques, which belong to two tiers.
%%
The first one consists of methods in the lineage of (affine) sparse
subspace clustering
(SSC/ASSC)~\cite{elhamifar2013sparse,soltanolkotabi2012geometric,li2018geometric}.
These two step methods write each data point as a sparse linear
combination of other data points, and the coefficients found are used
to obtain the clusters via spectral clustering.  Their correctness
hinges on the ability of spectral clustering to separate the clusters,
which relies on conditions (\bblue{\eg the absence of intersection between
the affine supports of the clusters}) that may not be met in practice.
%%
The second tier involves clustering methods using an explicit \ie
parametric cluster model~\cite{parsons2004subspace,wang2009ksubspaces}.  These techniques
face two difficulties.  The first is to avoid overfitting using a
complexity penalty (AIC, BIC, MDL, MML)~\cite{grunwald2007minimum}, as
a richer model always decreases the fitting error--\eg a plane better
fits noisy data distributed along a line than the line itself.
%%
%%
The second is to obtain the cluster mixture representing the data, a
task usually addressed using \toblue
an Expectation-Maximization procedure
\toblack ~\cite{dempster1977maximum,wu1983convergence}.
However, the main difficulty for heterogeneous mixtures (\eg clusters
of varying dimension) is to navigate in the space of models, a
difficult question typically undertaken via (split, merge, delete)
operations on the mixture components \cite{kasarapu2015minimum},
or using a combination of EM and model selection~\cite{figueiredo2002unsupervised}.
%% The second one consists of Expectation-Maximization based strategies~\cite{dempster1977maximum},
%% possibly combined with model control using \eg the minimum message length
%% \cite{figueiredo2002unsupervised}. This is also a demanding strategies, in particular to control
%% the singularization of the components and their number.

\toblack

%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%
%% SHORT
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%
\else
\paragraph{Clustering methods.}  Clustering, namely the task
which consists in grouping data items into dissimilar groups of
similar elements, is a fundamental problem in data analysis at
large~\cite{xu2005survey}.  Existing clustering methods may be
ascribed to four main tiers, including hierarchical
clustering~\cite{rod-peh-pcsa-73}, density based
clustering \cite{cheng1995mean,chazal2013persistence}, spectral
clustering~\cite{von2007tutorial}, as well as k-means
~\cite{arthur2007k} and variants related to
Expectation-Maximization \cite{dempster1977maximum,kasarapu2015minimum}.
%%
The goal of clustering being to aggregate points into coherent groups,
equally important are {\em cluster models} aiming at providing
insights into the geometry of the data.  This goal prompted the
development of \ksubspace clustering techniques, which belong to two
tiers.
%%
Sparse subspace clustering
(SSC/ASSC)~\cite{elhamifar2013sparse,soltanolkotabi2012geometric,li2018geometric}
reconstruct points as (sparse) linear combinations of neighbors, while
parametric cluster model use explicit analytical form for
clusters~\cite{parsons2004subspace,wang2009ksubspaces}.
\fi

%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

%% Zuo projection median  Y. Zuo, Projection-based depth functions and associated medians, Ann. Statist. 31 (5) (2003) 1460–1490.


% D.L. Donoho, Breakdown Properties of Multivariate Location Estimators (Ph.D. thesis), Harvard University, Boston, 1982.

%% [9] D. Donoho, M. Gasko, Breakdown properties of location estimates based on halfspace depth and projected outlyingness, Ann. Statist. 20 (4), (1992) 1803–1827.

%%  W.A. Stahel, Breakdown of Covariance Estimators, Technical Report 31, Fachgruppe für Statistik, Eïdgenoessische Technische Hochschule, Zürich, 1981, pp. 1--16




\toblue
\paragraph{Geometric centerpoints in data analysis.}
Cluster models providing insights on the geometry of a point
set also call for a discussion of high dimensional centerpoints and medians.
%%
The classical center of mass of a point set, which minimizes the sum of square distances
to data points  and is used in \kmeans, admits several important alternatives.
%%
The Fermat-Weber point is the point from $\Rd$ minimizing the sum of
Euclidean distances to all data points.  Unfortunately, this point is
hard to compute and unstable
\cite{kupitz1997geometric,weiszfeld1937point,bajaj1988algebraic,cohen2016geometric}.
%%
Building on Helly's theorem, a median can be defined as any point
whose Tukey depth is at least $\geq n/(d+1)$~\cite{tukey1975mathematics}.
%%
\ifLONG
(The Tukey depth or halfspace depth of a point $x$ is the smallest fraction of points of
any closed half-space containing $x$~\cite{tukey1975mathematics}.)
It is, however, challenging to compute. The classical randomized
algorithm~\cite{clarkson1993approximating} has been derandomized in
\cite{miller2009approximate}.  The complexity is subexponential in $d$,
but to the best of our knowledge, the algorithm is not practical.
\else
Such a point is, however, hard to compute 
\cite{clarkson1993approximating,miller2009approximate}.
\fi
%%
The projection median is defined by projecting the dataset onto random
lines, computing the univariate median for each projection, and
computing a weighted average of the data points responsible for these
univariate
medians~\cite{durocher2009projection,basu2012projection,durocher2017projection}.
It is an elegant, stable and remarkably effective generalization of
the univariate median.
%For statistical properties of these constructs, the reader is referred to
%~\cite{donoho1983notion,lopuhaa1991breakdown,davies2007breakdown}.
The projection median also underlies the construction of the
Donoho-Stahel estimator for outlier
detection~\cite{stahel1981breakdown,donoho1982thesis,donoho1992breakdown}.

%% Zuo projection median \cite{zuo2000general,zuo2003projection}
%% We note in passing that such centerpoints can be used to identify inliers/outliers
%% of a point set, by thresholding distance quantiles with respect to a centerpoint, and/or
%% via projections~\cite{todo-cite}.

\paragraph{Contributions.}  Two types of parametric cluster models have
been proposed recently~\cite{wang2009ksubspaces}: affine and spherical
clusters (SC). The former accommodates potentially unbounded (large)
clusters of arbitrary dimension.  The latter defines compact
(spherical) clusters based on the power distance of points with
respect to a sphere whose radius is (a fraction of) the variance of
distances to the cluster center--to be determined.
%%
However, the uniqueness of the SC center is not established,
and no algorithm is presented to compute it. (The calculation
presented assumes the center is known, and it solely observes that the
result obtained is consistent with the usual center of mass when the
fraction of the variance tends to zero~\cite{wang2009ksubspaces}.)

\toblue Our work, which focuses on the spherical cluster model, is
rooted in statistical analysis: identifying an object capturing a
global description of the point set. We make three contributions. 

First, we establish the SC cluster model is well posed -- that is the
solution is unique. Second, we present an exact solver using the
Clarke gradient on a suitable stratified cell complex defined from an
arrangement of hyper-spheres.  \toblue To the best of our knowledge,
our method is the first practical application of the general theory of
\textit{semiflows} of convex maps, and may be of independent interest
beyond the spherical clustering problem. In our setting, it is
numerically tractable and outperforms traditional convex optimization
frameworks, particularly in high dimensions, as demonstrated by our
experiments.  \toblue
%%
Third, we present experiments showing that the center of the SC model
behave as a parameterized high-dimensional median.

These contributions have two direct practical applications and implications.
%%
First, the center of SC can be computed efficiently, which is of interest
to compute high dimensional centers and/or identify inliers/outliers
in high dimensional data analysis.
%%
Second, our algorithm and its implementation provide the missing
machinery to compute mixtures of spherical clusters in affine
subspaces of positive codimension. 

\toblack

\begin{comment}
Finally, we use SESCs  in a clustering algorithm performing
model selection.  We develop a smart seeding heuristic beyond
the usual smart seeding from \cite{arthur2007k}, in the spirit of
split/merge operations on mixture components
~\cite{kasarapu2015minimum}. Our clustering scheme solves a
limitation of the approach from \cite{wang2009ksubspaces}, where
\quoteen{we choose the model with the smallest dispersion.}  -- a
choice favoring higher intrinsic dimension.
\end{comment}

All proofs and detailed algorithms are provided in the Supporting Information.

\section{\toblue Parametric cluster models: affine and spherical clusters} %%  and  Subspace Embedded Spherical Clusters} 
\label{sec:sesc}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Notations and terminology}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

%% Let $D$ be a set of $n$ points in $\Rd$. Let $C_1, \dots, C_k$ be
%% the clusters.  We consider a set of subsets $D_1,\dots,D_k$ forming a
%% partition of $D$, with $D_\ell$ the set of points associated to cluster
%% $C_\ell$.

\paragraph{Geometry.}
Let $D$ be a set of $n$ points in $\Rd$.  We consider a partition of
$D$ into $k$ clusters $C_1, \dots, C_k$, with $D_\ell$ the set of
points associated to cluster $C_\ell$.
%%
%%The unbiased variance estimate for distances within cluster $D_\ell$ of center $c_\ell$ satisfies
\toblue The unbiased sample variance for  cluster $D_\ell$ of center $c_\ell$ satisfies\toblack
\begin{equation}
\label{eq:stdevdist}
\stdevdist* =  \frac{1}{n-1}\sum_{x_i\in D_\ell} \vvnorm{x_i - c_\ell}^2
\end{equation}

Let $A = c + V$ be an affine space, with $c$ a
point in $\Rd{d}$ (think cluster center), and $V$ a vector space.  For any
point $x \in \Rd$, we denote by $\comppara{x-c}$ the orthogonal
projection of the vector $(x-c)$ onto $V$, and by $\compperp{x-c}$ the
orthogonal projection on $V^\perp$.

When fitting a model, the sum of squared distances from samples to the
model is called the {\em residual sum of squares (RSS)}, or dispersion
for short.

\toblue
Finally, a ball and a sphere of center $c$ and radius $R$ are respectively defined by 
$\vvnorm{x-c}^2\leq 1$ and $\vvnorm{x-c}^2 = 1$.
\toblack

\paragraph{Parametric cluster models.}
%% the subsets of $D$ containing all points associated with the corresponding cluster. A point belongs to one and only one cluster, and one does not want empty clusters, so $D_1,\dots,D_k$ is a partition of $D$.
Take $C_\ell$ for $\ell \in \intrange{1}{k}$ and suppose $D_\ell$ is
known. Cluster $C_\ell$ is described by the parameter set $\theta_\ell =
(\theta_{\ell,1}, ..., \theta_{\ell,r})$ and a function $d_\ell:
(x,C_\ell(\theta_\ell)) \mapsto d_\ell(x,C_\ell(\theta_\ell))$, that
is some distance from a point to the cluster. We call the description
of $C_\ell$ by the function $d_\ell$ a {\em parametric cluster model}.
%% that depends on $\theta_\ell$.
%%
We decompose the clustering problem into two sub-problems concerned with the minimization of
a  dispersion term based on squared distances:
%%
\begin{problem}[Cluster optimization] 
\label{pb:cluster-optim}
Let  $C_\ell$ be a parametric cluster. {\em Cluster optimization} is the optimization
problem seeking the cluster parameters minimizing the dispersion 
\begin{equation}
\label{eq:generaloptimsinglecluster}
\min_{\theta_\ell} \kmfunc, \text{ with } \kmfunc = \sum_{x \in D_\ell} d_\ell(x,C_\ell(\theta_\ell))^2.
\end{equation}
\end{problem}

%% REMOVED FOR AISTATS 2026
\begin{comment}
Finding the partition $D_1, \dots, D_k$ of $D$ minimizing the total dispersion yields
the global problem:
%%
\begin{problem}[Clustering] 
\label{pb:clustering}
Let a clustering be specified by the  cluster indicator matrix  $H \in \{0,1\}^{n \times k}$, with $x_i \in D_\ell \iff H_{i,\ell} = 1$. 
%%
{\em Clustering} is the optimization problem seeking the best partition and cluster parameters:
\begin{equation}
\label{eq:generaloptimmultipleclusters}
\min_{\theta_1, \dots, \theta_k, H} \Kmfunc, \text{ with } 
\Kmfunc = \sum_{\ell = 1}^k \sum_{x \in D_\ell(H)} d_\ell(x,C_\ell(\theta_\ell))^2,
\end{equation}
\end{problem}
%%
The celebrated \kmeans algorithm naturally follows this model, with function
$d_l$ the squared distance to the cluster center.  Problem
\ref{pb:clustering} however is much more general, making possible to
vary the distance $d_\ell$ on a per cluster basis, adopting suitable
local cluster models.
\end{comment}


\subsection{Affine and spherical clusters}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\paragraph{Affine clusters.}
As a first generalization of \kmeans, one can consider the distance from a data point
to an affine subspace, yielding \ksubspace clustering~\cite{wang2009ksubspaces}:
%%
\begin{definition}[Subspace cluster] \label{def:subspacecluster}
Let $A = c + V$ be some affine subspace of $\Rd$ where $c \in
\Rd$ is a point and $V$ is an $m$-dimensional linear subspace. The
    {\em subspace cluster} $C_\ell(A)$ is a cluster, where the
    distance from a point $x$ to the cluster is the distance to the
    subspace :
\begin{equation}
\label{eq:subspacedistance}
d(x, C_\ell(A))^2 := d(x,A)^2 = \vvnorm{ \compperp{ x-c}}^2.
\end{equation}
%% with  $\compperp{x-c)}$ is the orthogonal projection of $x-c$ onto $V^\perp$.
\end{definition}

\paragraph{Spherical clusters.}
As noticed in Introduction, affine clusters may be confounded by
noise, and suffer from their non compact nature.  This latter aspect
can be taken care of using {\em spherical clusters}. To see how,
recall that the power of a point $x$ with respect to a sphere $S(c,r)$
is defined by $\powerps{x}{S} = \vvnorm{x-c}^2 - r^2$.
%%
Following \cite{wang2009ksubspaces}, we define:
%%
\begin{definition}[Spherical cluster] 
\label{def:sphericalcluster}
Let $\eta \in (0,1)$ be a hyperparameter, and let $c_\ell$ be a
    point called the {\em cluster center}.  Given the set $D_\ell$,
    the distance function associated to the {\em spherical cluster}
    $C_\ell(c_\ell)$ reads as
%%
\ifTWOCOLUMNS
\begin{align}
\label{eq:sphericaldistance}
\begin{split}
d(x,C_\ell(c_\ell))^2 
&:=  \max \left(0, \vvnorm{x-c_\ell}^2 - \eta \stdevdist*  \right)  \\
&= \max \left(0, \powerps{x}{S(c_\ell, \sqrt{\eta} \stdevdist)}\right).
\end{split}
\end{align}
\end{definition}
\else
\begin{equation}
\label{eq:sphericaldistance}
d(x,C_\ell(c_\ell))^2 
:=  \max \left(0, \vvnorm{x-c_\ell}^2 - \eta \stdevdist*  \right)  = \max \left(0, \powerps{x}{S(c_\ell, \sqrt{\eta} \stdevdist)}\right).
\end{equation}
\end{definition}
\fi
%%
The rationale of this definition is that one wishes to find the center
minimizing  the cost of outliers--points outside the spherical cluster.

\toblue
\subsection{Spherical clusters: discussion}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%



Cluster optimization (Pb. \ref{pb:cluster-optim}) for the spherical
cluster model requires optimizing the following functional:
%%
\begin{align}
\label{eq:Feta}
\Feta{c} 
& := \sum_{x_i \in D_\ell} \max \left(0, \| x_i-c \|^2 - \frac{\eta}{n-1}\sum_{x_j\in D_\ell} \|x_j - c\|^2 \right)\\
& = \sum_{x_i \in D_\ell} \max \left(0,   \powerps{x_i}{S_c} \right).
\end{align}
Note that the power distance $\powerps{x_i}{S_c(c, R_c)}$ is taken
with respect to the sphere centered at $c$ and with squared radius
$R_c^2 = \nicefrac{\eta}{n-1}\sum_{x_j\in D_\ell} \vvnorm{x_j-c}^2$.

A central result of our work is to show that the optimal center
$\optexact$ is unique and can be computed efficiently. 

This optimization problem calls for several important comments.

\toblue

\paragraph{Using the (quadratic) power  distance.} Using
the power distance rather than the Euclidean distance serves two purposes.
%  $\vvnorm{x_j-c}^2$ rather than  $\vvnorm{x_j-c}$
%%
First, when $\eta\rightarrow 0$, the optimal center $\optexact$ converges to the usual center of mass
of the point set, namely the point minimizing the sum of squared distances.
%%
Thus, the centerpoint $\optexact$ may be seen as a parameterized center of mass.
%%
Second, using the squared distance simplifies the algebraic
calculations carried out in the next sections, and alleviates
constraints on number types to obtain a robust implementation.
This design choice is actually common.
%%
On the one hand, we have recalled in Introduction the difficulty of
computing the Fermat-Weber point instead of the center of mass (COM).
%%
One the other hand, one should recall that the most general affine
Voronoi diagrams are power diagrams (replacing the Euclidean distance
by the power distance), while Voronoi diagrams using a multiplicative
version of the Euclidean distance are much more complex to
handle~\cite{boissonnat2006curved}.


\paragraph{Radius $R_c$ and its dependency to the centerpoint.} The radius used to define the sphere is not fixed but
  depends on the location of the centerpoint $c$.  It is this
  interplay which makes the problem difficult. Strictly speaking, it
  is the {\em parameterized} std deviation of distances from $c$ to
  the data points.  We will abuse terminology and plainly speak of the
  {\em distance variance/std deviation} to the center $c$.

The dependency of $R_c$ to $c$ introduces a subtle mix between inliers
and outliers: inliers lie inside $S_c$ and incur zero cost,
whereas outliers lie outside and pay the power distance to $S_c$.  The
optimization problem is therefore ruled by the balance
between these two point sets.

\paragraph{Non-smooth convex optimization problems.}
Convex optimization has been extensively studied over the past
decades. Our problem involves the optimization of a non-smooth convex
function, for which standard methods such as gradient descent may
perform poorly in certain cases. Theoretical guarantees are also
weaker: classical bounds on the number of iterations required to reach
a point within distance $\eps$ of a minimizer grow much faster as
$\eps \to 0$ than in the smooth setting \cite{Bubeck}.
As we shall see, our constructive proof and the associated algorithm
avoid this caveat.

\paragraph{Hyper-parameter $\eta$ and balance between inliers vs outliers.} The 
parameter $\eta$ determines the radius $R_c$ and therefore the
functional to be optimized.  Varying $\eta\in (0,1)$ yields a one
parameter family of optimization problems.  Let $\numoutliersGEN{c}$
be the number of outliers with respect to a sphere of radius $R_c$
centered at $c$.
%%
Upon varying $\eta$, two statistics of interest to assess the role of
$\eta$ are the (i) the {\em average outlier cost} $\Feta(\optexact)/
\numoutliersSC$, and (ii) the {\em outlier ratio}
$\numoutliersCOM/\numoutliersSC$ which compares the number of outliers
yielded by our model and that associated with the usual COM.  See
Section \ref{sec:contenders-stats} for details.

\paragraph{Spherical clusters and their merits.}
The fundamental motivation underlying the spherical cluster model
is rooted in statistical analysis:
identifying an object capturing a global description of the point set.
%%
In a broad machine learning / data analysis context, this model is
attractive for several reasons.

First, its cluster center defines a high dimensional centerpoint which
can be compared to the usual center of mass and high dimensional
medians.
%%
Second, this cluster model provides a natural way to identify inliers
and outliers, and the {\em scale} at which they appear when varying
$\eta$.
%%
Third, the existence of an efficient (exact) algorithm to compute it
paves the way to mixture design algorithms--to be explored in further work.



%% In the following, we focus on the algorithm to compute the cluster that best fits the data when the points of $D_\ell$ are known, before putting it in the context of a Lloyd-like algorithm to compute the clustering.

%% \subsection{Clustering algorithm}
%% %%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%
%% Otherwise, criteria such as the   Bayesian Information Criterion (BIC),
%% which assess the model based on its fit to the data and its complexity, can be used.
%% Yet another  criterion is the Akaike Information Criterion.
\toblack

\section{Spherical cluster optimization}%% (3 pages?)}
\label{sec:sesc-opt}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We study problem of Eq. \ref{eq:Feta}, simply denoting $\Feta$ as $F$
since $\eta$ is fixed.

\subsection{Functional decomposition and geometry of the sub-functions}
%%i-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

For a fixed data set $D_\ell$, we aim at minimizing over $\Rd$ the map  $\Feta{c}$ from Eq. \ref{eq:Feta}.


%%
To study the previous function, for each $x_i \in D_\ell$, let
\begin{equation}
%%\ftildeeta{c}  
f_{\eta,x_i}(c) 
:= \| x_i-c \|^2 - \eta \frac{1}{n-1}\sum_{x_j\in D_\ell} \|x_j - c\|^2.
\end{equation}
so that 
\begin{equation}
\label{eq:fetasum}
\Feta{c} = \sum_{x_i \in D_\ell} \max(0, f_{\eta,x_i}(c)).
\end{equation}
We first analyze the sub-functions and $f_{\eta,x_i}$ in order to
analyze the main function $\Feta$.  In the sequel, we assume that (i)
the set $D_\ell$ is fixed, (ii) $x_i \in D_\ell$, (iii) $0 < \eta < 1
- 1/n$ (iv) $\eta$ is fixed, so that we drop $\eta$ from the notations
(e.g writing $F, f_{x_i}$ instead of $F_{\eta}, f_{\eta, x_i}$ to ease
notations).

%%
Studying the function $f_{x_i}$ benefits from the geometry of the
following {\em sink region} yielding a null cost: %%  (Fig. \ref{fig:sink-arrangement}):
%%
\begin{definition}
\label{def:sink}
The {\em sink region} $B_{x_i}$ is the set over which $f_{x_i}$ does
not contribute to $F$, that is $B_{x_i} := f_{x_i}^{-1}\left(- \infty,
0 \right]$. We denote $S_{x_i}$ its topological boundary.
%%
%% \begin{equation}
%% B_{x_i} := f_{\eta,x_i}^{-1}\left(\{0\}\right).   
%% \end{equation}
\end{definition}
%%
Remark that since $ \eta < 1 - \frac{1}{n}$ the intersection $B$ of
all $B_{x_i}$ is necessarily empty, as any $x$ belonging to this set
would verify $\vvnorm{x - x_i}^2$ strictly lower than the average
$\frac{1}{n} \sum_{x_i \in D_{\ell}} \vvnorm{x - x_i}^2$.
%\label{rk:emptyB}

The following results from an elementary calculation.
%%
\begin{lemma}[Geometry of $B_{x_i}$] 
\label{lem:sink-geom}
Let $\eta' = \frac{n-1}{n} \eta$. Each map $f_{x_i}$ is proportional to a spherical power, and takes the form
\begin{equation}
f_{x_i}(c) = (1 - \eta') \left (\vvnorm{c - c_i}^2 - R_i^2 \right)
\end{equation} 
Putting $\bar{x} := \frac{1}{n} \sum_{x_j \in D_\ell} x_j$, the center $c_{i}$ and the radius $R_{i}$ of said sphere satisfy the following.
\begin{equation}
\label{eq:sink-spec}
\left \{
\begin{aligned}
c_i   &= \frac{x_i - \eta'\bar{x}}{1-\eta'},\\
R_i^2 &= \left\| \frac{x_i - \eta'\bar{x}}{1-\eta'} \right\|^2 
        - \frac{\|x_i\|^2 - \frac{\eta'}{n} \sum_{x_j \in D_\ell} \|x_j\|^2}{1-\eta'}.
\end{aligned}
\right.
\end{equation}

As a consequence the sink region $B_{x_i}$ is a non-empty closed ball of $\Rd$, and $S_{x_i}$ is its associated sphere. 
\end{lemma}
%%
As an immediate corollary, $\max(0, f_{x_i})$ is a convex map. In $\Rd \setminus B_{x_i}$, it is quadratic with gradient $\nabla f_{x_i}(c) = 2(1- \eta')(c - c_i)$, while being identically zero inside $B_{x_i}$.
%%

\subsection{Arrangement of hyper-spheres underlying the objective function}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%


%% Recall the observation of equation \ref{eq:fetasum}:
%% \begin{equation*}
%%     \Feta (c) = \sum_{x_i \in D_\ell} f_{\eta,x_i}(c)
%% \end{equation*}

\toblue From the previous lemma, $F$ is also continuous, convex and
piecewise quadratic.
\toblack
Finding the optimal cluster center requires
understanding of the relationship between all sink regions.

\paragraph{Arrangement.}
An {\em arrangement} of hyper-surfaces is a decomposition of 
$\Rd$ into equivalence classes of points using their position
with respect to these hyper-surfaces \cite{halperin2017arrangements}. 
%Arrangements are a type of stratification of whole space $\Rd$.
%%
We apply this concept to the spheres bounding the sink regions (Lemma
\ref{lem:sink-geom}).
%%

%%
For $x \in \Rd$ and $i \in \intrange{1}{n}$, consider the following
signature which states whether point $x$ lies outside/on/inside the
spheres $S_{x_i}$. It is a length $n$ vector with one entry in $\{-1, 0, 1 \}$ for each sink-defining ball:
%%
\begin{equation}
    \sigma(x) := (sign(f_ {x_1}(x)), \dots,sign(f_ {x_n}(x))).
    \label{eq:signature}
\end{equation}
The signature defines an equivalence relation, where two points are
equivalent if they have the same signature. We call {\em cells} the equivalence classes, and we use the notation $\mathcal{C}$ to denote them. By definition, cells are non-empty
and characterized by the three-set partition $I^+(\calC), I^0(\calC), I^-(\calC)$ of $\intrange{1}{n}$, where the sets are defined respectively as the sets of indices $i$ where $f_{x_i}$ are positive, zero, and negative. 
%%
Note that generically, $\tau+1 \leq d$ spheres in dimension $d$
intersect along an $l=d-(\tau+1)$ sphere; thus we let the {\em
  dimension} of a cell $\calC$ be the number $d - \#I^0(\calC)$. Cells
of dimension $d$ are said to be {\em fully dimensional} and are open
subsets of $\Rd$, while others are said to be of \textit{positive
  codimension}.


\paragraph{Combinatorial decomposition of $F$.}
On a cell $\calC$, $F$ is determined by the value of $f_{x_i}$ where $i$ ranges among $I^+(\calC)$. More precisely, we have
\begin{equation}
F_{|C}(c) = \sum_{i \in I^+(\calC)} f_{x_i}(c) = f_{I^+(\calC)}(c),
\end{equation}
where $f_{J} := \sum_{i \in J} f_{x_i}$ for any subset $J$ of $\intrange{1}{n}$. In the same vein, we put $S_J := \bigcap_{i \in J} S_{x_i}$ so that in a generic configuration of spheres any cell with non empty $I^0$ is a relatively open subset of $S_{I^0}$. We define $c_{J}$ to be the center of mass of all $c_i$ where $i$ ranges among $J$. 
%\fc{define $c_J$ here and possibly mention the SI}
\begin{equation}
\label{eq:Cj}
\optcell{J} :=  \frac{1}{\# J} \sum_{i \in J} c_i.
\end{equation}
%%
Putting $R^2_J := \vvnorm{c_J}^2 + (\#J)^{-1} \left ( \sum_{i \in J} R^2_i - \vvnorm{c_i}^2 \right) $, straightforward computations yield
\begin{equation}
f_{J}(c) = (1 - \eta')\#J \left [ \vvnorm{c - c_J}^2 - R_J^2 \right].
\end{equation}
\toblack
Since any full-dimensional cell $\calC$ is open, $F$ is twice differentiable in $\calC$ with gradient and Hessian as follows:
\begin{equation}
\label{eq:HessGradFullDim}
\begin{cases}
\nabla F_{|\calC}(c) = 2(1- \eta')\#I^+(\calC) \left( c - \optcell \right ) \\
H F_{|\calC}(c) = 2(1-\eta')\#I^+(\calC) \mathrm{Id}.
\end{cases}
\end{equation}


%% \begin{equation}
%% \supportcell \; : \; \left\lbrace 
%% \begin{aligned}
%%             \calA &\to \mathcal{P}( \intrange{1}{n} ) \\
%%             \calC &\mapsto \left\{ i \in \intrange{1}{n}  : \calC \cap B_{x_i} = \emptyset \right\}
%% \end{aligned} \right.
%% \end{equation}
%%

\subsection{Strict convexity and optimization}
\label{sec:strict-convex-optim}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\toblue We have seen that $B = \bigcap_{x_i \in D_{\ell}} B_{x_i}$ is
empty, and from previous computations the Hessian of $F$ is
almost-everywhere positive definite outside of $B$. This leads to the
strict-convexity (and thus, well-posedness) of the problem.  \toblack
%%
\begin{theorem}[Strict convexity of $F$] 
%\label{thm:strcvxf}
Let $D_\ell$ be a set of $n$ points of $\Rd$, with at least two
distinct points, and assume that $0 < \eta < 1 - \frac{1}{n}$. 
%%
Then the associated $F$ map is $2(1 - \eta')$-strongly convex on $\Rd$, and is a fortiori strictly convex.
Its minimization problem admits exactly one solution in $\Rd$.
\end{theorem}
\color{black}

\paragraph{Minimum of $F$ on a full-dimensional cell.} 
From Eq. (\ref{eq:HessGradFullDim}), the minimum of $F$ is attained
on a full-dimensional cell $\calC$ if and only if the gradient of $F$
vanishes in $\calC$, leading to the following characterization.
\begin{equation}
\argmin[\Rd] F \in \calC \iff \optcell \in \calC.
\end{equation}

\toblue The minimum of $F$ may be attained on a cell of positive
codimension, (see e.g. Fig. \ref{fig:opt-codim}, left.) so restricting
attention to full-dimensional cells is not sufficient. Among cells of
positive codimension, there is no clear closed-form solution for a
minimum.  \toblack

\begin{figure}[htb]% or !htb or H
\begin{tabular}{cc}
\includegraphics[width=0.48\linewidth]{\wfig/objectivefunction-min-arrangement-3pts.png}&
\includegraphics[width=0.48\linewidth]{\wfig/objectivefunction-min-arrangement-3pts-bis.png}
\end{tabular}
\caption{{\bf Minima of $F$ on cells of various dimensions.}
Data point in orange, minima in red. Selected level sets (in dotted-lines) are also reported.}
\label{fig:opt-codim}
\end{figure}

\section{Optimization: computing the unique minimizer of $\Feta$}
\label{sec:optim}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Having established the strict convexity of $\Feta$, 
we compute its unique minimizer. 
(As in the previous section, we simply denote 
$\Feta$ as $F$.)
%% Moreover, we denote $f_{x_i}$ by $f_i$ to ease notation.
%%
Our algorithm actually minimizes  a function of the form $\sum_{i} \max(0, \vvnorm{x - c_i}^2 - R^2_i)$,
and constructs a finite sequence of points $(x_n)$ by induction. The last point is the optimum of $F$.
%%
It assumes infinite precision -- the so called \textit{real RAM model},
and also assumes that all points $c_i$ are in a generic position and that when spheres $S_i$
intersect.
%%
See Sections \ref{sec:numerics} and \ref{sec:genericity} for comments on these assumptions.

%% \footnote{We do not require $\cap B_i$ to be empty, even though it is the case with the construction of $c_i, R_i$ from $x_i$.}. 

\subsection{Subdifferential and generalized gradient}

We face a non-smooth convex optimization problem without
constraint. Guarantees on the speed of convergence of algorithms with
such assumptions as a number of iterations are rather weak. While
gradients exists almost-everywhere, the classical gradient descent
method may get trapped to bottleneck situations \cite{Bubeck} leading
to a precision rate of $O \left (\frac{1}{\sqrt{T}} \right)$, where
$T$ is the number of steps. Further non-smooth investigation revolves
around the use of the \textit{subdifferential} or equivalently the
\textit{Clarke gradient} \cite{clarke1997nonsmooth} of the
function. For a convex function, the subdifferential/Clarke gradient
of $F$ at $x$, denoted by $\clarke F(x)$ is a convex set defined
below, while the generalized gradient $\gengrad F(x)$ is its element
of least norm (Fig. \ref{fig:clarke-gradient}):
\begin{equation}
\label{eq:clarke-gengrad}
\begin{cases}
\clarke F(x) & :=  \{ s, f(y) - f(x) \geq s \cdot ( y - x), \forall y\} \\
\gengrad F(x) & :=  \underset{u \in \clarke F(x)} \argmin \vvnorm{u}. 
\end{cases}
\end{equation} 

Gradient samplings methods~\cite{gradient_sampling} avoid the earlier
described bottleneck configurations with a good descent direction
obtained by approximating the generalized gradient. Given an
arrangement of the space, the recent so-called stratified gradient
sampling \cite{stratified_gradient_sampling} proposes to use the
arrangement to efficiently determine a good descent direction. To
tailor an exact algorithm, we use the structure at our
disposal. Indeed, for any fixed $x$ in $\Rd$, we let $I^+, I^0$ and
$I^-$ be the three-set partition associated to the cell of $x$.  Then
the subdifferential of $F$ at $x$ can be expressed as follows.
\begin{equation}
\clarke F(x) = \{ \nabla f_{I^+}(x) + \sum_{i \in I^0} \lambda_i \nabla f_i(x) , 0 \leq \lambda_i \leq 1\}.
\end{equation}
The generalized gradient can thus be expressed as a solution to the following quadratic programming (QP) problem, which admits a unique solution in $\lambda$ when all $c_i$ are in generic position -- Sec. \ref{sec:alg-pro}:	
\begin{equation}
\label{eq:clarke-qp}
\gengrad F(x) = \underset{0 \leq \lambda_i \leq 1} \argmin \left \{ \lvert \lvert \nabla f_{I^+}(x) + \sum_{i \in I^0} \lambda_i \nabla f_i(x) \rvert \rvert^2 \right \}.
\end{equation}
Letting $(\alpha_i)_{i \in I^0}$ be the unique solution to this
problem, we let \footnote{Not to be confused with the sets
$I^+(\calC), I^0(\calC), I^-(\calC)$, defined from the sign of the
power distance.}
\begin{equation}
\label{eq:IzIpIm}
\begin{cases}
I^0_*(x) & := \{ i \in  I^0(x), 0 < \alpha_i < 1 \}\\
I^+_*(x) & := I^+(x) \cup \{ i \in I^0(x), \alpha_i = 1 \} \\
I^-_*(x) & := I^-(x) \cup \{ i \in I^0(x), \alpha_i = 0 \}.
\end{cases}
\end{equation}
Moreover, we let $\calC^*(x)$ be the cell with three-set partition
$I^+_*(x), I^0_*(x),I^-_*(x)$. \toblue This cell plays an important
role in our algorithm, as the following paragraph about semiflows will
demonstrate.\toblack

\paragraph{\textbf{Describing the semiflow of $F$}.}
Even though $\gengrad F$ might not be continuous, by convexity of $F$
from any starting point $x$ there exists (see for instance
\cite{subgradient_talweg, evolution_problems}) a trajectory $t \mapsto
x(t)$ (with $x(0) = x$) called a \textit{semiflow} verifying (for $t
\in \mathbb{R}^+$):
\begin{equation}
x'(t) = -\gengrad F(x(t)).
\end{equation} 
%Moreover, the nature of the stratification behind $F$ ensures that this trajectory has right derivative $-\gengrad F(x(t))$ for very $t \geq 0$. 
In particular $F(x(t))$ decreases at rate $\vvnorm{\nabla
  F_*(x(t))}^2$, and by strong convexity $x(t)$ reaches the argmin of
$F$ over $\Rd$ in a finite time, where it is stationary
\cite{evolution_problems, gradient_descent_o_minimal}. Given the
structure of our $F$, there are three possible behaviors for the
semiflow with starting point $x$:
\begin{itemize}
\item If $x$ is in a full dimensional cell, the semiflow starting from
  $x$ begins by a segment heading towards $c_{I^+(x)}$, until it
  reaches a new cell or $c_{I^+(x)}$ which is the minimum.
\item If $x$ lies in a cell of positive codimension and $I^0_*(x)$ is
  empty, the semiflow enters the non-empty, full dimensional cell
  $\calC^*(x)$ and follows a straight line in this cell until it meets
  a new cell, as described as above.
\item Else, $I^0_*(x)$ is not empty. One can show that if the Clarke
  QP (Eq. \ref{eq:clarke-qp}) lies in what we call a non-degenerate
  position\footnote{We say that the QP problem of minimizing
  $\vvnorm{u + \sum_i \lambda_i v_i}, 0 \leq \lambda_i \leq 1$ lies in
  a non-degenerate position when the argmin $w$ is such that the set
  of $i$ such that $\scal{w}{v_i} = 0$ is exactly the set of $i$ such
  that the coefficient of $v_i$ in the decomposition of $w$ is neither
  0 or 1. This condition is standard in the sense that for a given
  box, for almost all isometries acting on that box, the image box
  lies in a non-degenerate position.}, for small $t$ the semiflow
  enters the non-empty cell of positive codimension $\calC^*(x)$,
  which is a subset of $S_{I^0_*(x)}$. Points $x$ with a degenerate QP
  problem are of measure zero, however points $x$ such that the
  trajectory $x(t)$ reaches a degenerate QP position, making the
  semiflow intractable, are not.
\end{itemize}

\subsection{Exact algorithm}  
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

We develop algorithm \algoexact mimicking the semiflow trajectory
except for the third type of trajectory described above to seek for
the minimum of $F$.  See Algo. \ref{alg:exact} for the pseudo-code -- Sec. \ref{sec:alg-pro}.

It can be decomposed into three so-called main \textit{procedures},
which are \algtele, \algline, \algsphere. The latter is further
described using two procedures \algsphereinter and
\algminsphereinter. A sixth procedure used in \algline and \algsphere
is \algqp. Except for the latter which consists in solving a classical
QP programming problem, their pseudo-code can be found in -- Sec
\ref{sec:alg-pro}. Procedures \algline, \algsphere are illustrated in
Fig. \ref{fig:trajectory}.

(i) If $x_n$ lies in a full dimensional cell $\calC$ (usually at the
start of the algorithm), we check if $I^+(\calC)$ contains
$\optcell$. If so, we let $x_{n+1}$ be $\optcell$ and we stop the
algorithm. Since this step \toblue does not follow the semiflow
\toblack we call it the \algtele procedure.  If there is no
teleportation, we obtain $x_{n+1}$ from the \algline procedure within
$\calC$, which is described as follows. We seek the first point on the
half-line starting from $x_n$ heading towards $c_{I^{+}(\calC)}$
meeting another cell, and we let $x_{n+1}$ be this point. This is done
by solving for quadratic equations (in $t$) of the form $\vvnorm{x_n +
  tu - c_i}^2 = R_i^2$.
 
(ii) Else $x_n$ starts in cell of positive codimension. Compute the
generalized gradient of $f$ at $x$ as well as the associated
$I^+_*(x_n), I^0_*(x_n), I^-_*(x_n)$ with the \algqp procedure (\ie
solving Eq. \ref{eq:clarke-qp}).
\begin{itemize}
\item If the generalized gradient is zero, the minimum has been
  reached and we can stop the algorithm.

\item If $I^0_*(x_n)$ is empty, follow the \algline procedure
  described earlier within the full dimensional cell
  $\calC^*(x_n)$. Take $x_{n+1}$ to be the point given by this
  procedure.

\item Else, the semiflow starting from $x_n$ stays in $S_{I^0_*(x_n)}$
  and we follow the \algsphere procedure, which consists in the
  following. Compute the point $y$ where $f_{I^+_*(x_n)}$ restricted
  to $S_{I^0_*(x_n)}$ reaches its minimum via a procedure called
  \algminsphereinter described in more details in the appendix.  If
  $y$ is in the cell $\calC^*$, let $x_{n+1}$ be $y$. Else, compute
  the center $c_S$ and radius $R_S$ of $S_{I^0_*(x_n)}$ via the
  \algsphereinter procedure. Via the parameterization $[0,1] \mapsto
  c_S + R_S\frac{(1 -\lambda) x_n + \lambda y - c_S}{\vvnorm{(1
      -\lambda) x_n + \lambda y - c_S}}$ of the geodesic on
  $S_{I^0_*(x_n)}$, check the first point on the geodesic leaving the
  cell $\calC^*$. Let $x_{n+1}$ be this point.
\end{itemize}

\ifTWOCOLUMNS

\begin{figure}[htb]% or !htb or H
\centerline{\includegraphics[width=.95\linewidth]{\wfig/fig_exemple_algo_exact.pdf}}
\caption{{\bf \algline (from $x_0, x_1, x_2$) and \algsphere (from $x_3$) steps. Underlying trajectories are depicted in dark green. Point $y$ is obtained by \algminsphereinter  point $x_3$.}}
\label{fig:trajectory} 
\end{figure} 

\else

\begin{figure}[htb]% or !htb or H
\centerline{\includegraphics[width=.6\linewidth]{\wfig/fig_exemple_algo_exact.pdf}}
\caption{{\bf \algline (from $x_0, x_1, x_2$) and \algsphere (from $x_3$) steps. Underlying trajectories are depicted in dark green. Point $y$ is obtained by \algminsphereinter  point $x_3$.}}
\label{fig:trajectory} 
\end{figure} 

\fi

\toblue
Following the semiflow ensures that our algorithm converges in a known
number of steps in a certain neighborhood of the point $x^*$ where $F$
reaches its minimum. The number of steps is related to the number of
faces of the Clarke gradient $\clarke F(x^*)$, which is $3^c$
where $c$ is the number of spheres on which $x^*$ lies.
We refer the reader to the proof in appendix 
for the quantification of the neighborhood size.
\toblack

\newcommand{\thmStatement}{\toblue 
%% There exists a ball centered at the point $x^*$ where $F$ reaches its
%% minimum, with radius $R > 0$ depending on (i) the maximum radius of
%% the ball only cells touching $x^*$ (ii) the Clarke gradient of
%% $\clarke f(x^*)$, in which the algorithm converges in at most
%% $3^{c}$ steps, where $c$ is the number of spheres on which $x^*$
%% lies.
%%
Denote by $x^*$ the point where $F$ reaches its minimum, and let $c$ be
the number of spheres on which $x^*$ lies.
%%
There exists a ball centered at $x^*$
 with radius $R > 0$ depending on (i) the maximum radius of
the ball only cells touching $x^*$, and (ii) the Clarke gradient
$\clarke F(x^*)$, in which the algorithm converges in at most $3^{c}$ steps.
\toblack
}

\begin{theorem}[Algorithm convergence]
\label{thm:convergence}
\thmStatement
\end{theorem}

\subsection{Combinatorial complexity}
\label{sub:complexity}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

The complexity (number of cells of all dimensions) of the arrangement
of $n$ spheres in $\Rd$ is $O(n^d)$~\cite{toth2017handbook} and the
bound is tight in the worst case. Despite this, we may expect a number
of steps polynomial in $n$ if calculations remain {\em local} in the
arrangement, a fact substantiated by our experiments. Note that for
small values of $\eta$, the center of mass provides a warm start to
the algorithm.  Indeed the solution of the minimization problem of
$F_{\eta}$ varies continuously in $\eta$, and the barycenter is
solution to the problem with $\eta =0$.

Computations at each step are linear in the total number of
points since \algline (resp. \algsphere) computes the first sphere 
crossed by a line (resp. a sphere geodesics).  Other computations
involved at each step are at most cubic in the number of spheres on
which the current point lies, be it by solving the \algqp problem,
inverting a matrix in \algsphereinter or computing
an affine projection in  \algminsphereinter.
%%

\toblue While the limiting factor of our approach is the absence of
constructive bound on the number of steps, we point out that contrary
to classical methods, our theoretical analysis shows that the exact
minimizer is reached in a finite number of steps, and that this number
is bounded when the algorithm starts in a neighborhood of a certain
size (see Theorem \ref{thm:convergence} for the exact statement). In
classical methods such as the subgradient descent, the final point is
guaranteed to lie at distance at most $\eps$ to the minimizer after a
number of steps tending to infinity as $\eps$ goes to 0.  To ensure an
exact convergence with precise complexity guarantees, one could thus
use a subgradient descent and finish the job with our algorithm. In
practice, we did not have to resort to such ad-hoc methods, as the
algorithm outperforms the classical methods used in non-smooth convex
optimization -- see Sec. \ref{sub:comp_practice}.  \toblack




\subsection{Numerics}
\label{sec:numerics}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\paragraph{Arithmetics and number types.}
The algorithm from Sec. \ref{sec:optim} is described assuming the real
RAM model computing exactly with real numbers.  On the other hand,
geometric calculations (predicates, constructions) are known to be
plagued with rounding errors~\cite{kettner2008classroom}.
%%
Serious difficulties may be faced for 
cascaded constructions, which iteratively embed new geometric objects
(the points of the pseudo-gradient trajectory in our case).
%%
Advanced number types combining multiprecision and interval
arithmetics can be used to maintain accurate representations in such
cases. See \eg random walk inside polytopes
\cite{chevallier2022improved} or trajectories of the flow complex (the
Morse-Smale diagram of the distance function to a finite point
set)~\cite{cazals2021frechet}.

In the sequel, we review the numerically demanding operations required by our algorithm,
and refer the reader to Sec. \ref{sec:experiments} for experiments with our python based implementation.

\paragraph{Exact solver.} The solver uses the following  predicates and constructions:\\
\sbulem{\bf Solving the \algqp procedure.} The library {\em cvxpy} uses non-exact solvers to find the minimum of a QP problem. 
While those solvers usually give a result up to machine precision, we found out that they were prone to instability in minimizing functionals of the type $\vvnorm{g + A\lambda}^2$ when $g$ is vector of norm largely greater than both 1 and than that of the columns of $A$, 
with constraints $0  \preceq \lambda  \preceq 1$, in the sense that those solvers would claim the problem to be unfeasible. 
The equivalent problem of minimizing $\vvnorm{\frac{1}{\vvnorm{g}}\left ( g + A\lambda \right)}^2$ was sufficient in addressing those issues. 
The precise computation of the vector $\lambda$ is not needed as we only need to check for the index $i$ with respectively $\lambda_i \in \{0 \}, (0,1)$ and $\{1\}$. The entries with values $0$ and $1$ are usually reached with precision greater than machine precision. %% , surely for geometry reasons.

\sbulem{\bf Solving the \algline and \algsphere procedures.} Starting from a point $x$, with a prescribed direction $u$, procedure \algline seeks the first  $t$ such $x +tu$ changes cell, that is, the smallest positive $t$ verifying $\vvnorm{x + tu - c_i}^2 = R^2_i$. This is obtained as roots of a second-degree polynomial. To weaken imprecision we chose to solve this equation with a renormalized vector $u$ of norm 1.  Similarly, given two points $x, y$ on a sphere of radius $R_S$ and center $c_S$, procedure \algsphere computes the first point on the geodesic between $x$ and $y$ changing cell by solving for a quadratic equation. 

\sbulem{\bf Solving the \algsphereinter procedure.} The pair center/radius $c_S, R_S$ used above is obtained as the center and radius of an intersection of spheres. As described in the appendix, the center is obtained through the computations of a projection onto an hyperplane defined by linear equations involving the centers of said spheres. Computing $R_i$ is done by solving for a quadratic equation.

\sbulem{\bf Solving for the \algminsphereinter procedure.} Given an intersection of spheres $S_I$, with $I \subset \{ 1, \dots, n\}, \# I \leq d$ the \algminsphereinter procedure minimizes a function of the form $(f_J)_{|S_I}$ by computing a similar projection on a convex hull.

The genericity assumptions and the robustness of our routines are further discussed
in the SI Section  \ref{sec:genericity}.


\paragraph{The BFGS solver.}
The Broyden–Fletcher–Goldfarb–Shanno quasi-Newton method (BFGS) is
designed to be very efficient on twice differentiable function by
approximating the Hessian matrix without any matrix inversion (in
opposition to Newton's methods), using the gradient. When the gradient
is not given, it is estimated using finite differences.
%%
While the  objective function $\Feta$ is not differentiable,
it is also  known that BFGS  works well in practice for
non-differentiable functions \cite{lewis2012nonsmooth}.
%%
The next section challenges this observation for $\Feta$.

\toblue
The BFGS solver is systematically launched from a warm start
at the center of mass of the point cloud processed -- as for the exact solver.
\toblack


\subsection{Genericity assumptions}
\label{sec:genericity}

The genericity condition for our algorithm to work at a point $x$ is
that the intersection of spheres containing $x$ is transverse. For $k$
spheres, this existence of a non transverse intersection is checked as
follows.  The non transversality at point $x$ reads as $\sum_i
\lambda_i (x-c_i)=0$ or equivalently $(\sum_i \lambda_i)x = \sum_i
\lambda_i c_i$, which requires discussing two cases: (Case 1) $\sum_i
\lambda_i \neq 0$, and (Case 2) $\sum_i \lambda_i = 0$. In each case,
we need to check the points $x$ satisfying these conditions satisfy
the sphere equations.

To do so, Case 1 requires solving a QP problem. Case 2 requires
computing the null space of the matrix 
$A=( \{[c_i\ 1]^T\}_{i=1,\dots,k} )$, and if $Ker(A) \neq {0}$, one needs
to further check the sphere feasibility conditions, which require
solving another linear system.

Also note that at a given point $x$, the test simply boils down to
checking that the vectors $x-c_i$ are linearly independent.
\medskip


The exact computation of trajectories by our algorithm is more
involved. It requires cascaded degree two and degree four algebraic
numbers. (NB: degree two when intersecting a segment with a sphere; degree
four when intersecting a geodesic along a sphere with another sphere.)
%%
A robust numerical solution could be obtained using say an interval
number type with bounds of arbitrary precision, e.g. the iRRAM
library~\cite{muller2001irram}.

In practice though:
\begin{itemize}
\item We do not check the transversality condition, as even in medium
  dimensional spaces, the points where the intersections are not
  transverse are scarce, and our trajectories do not cross them.
\item We do not use elaborate number types, since the observed
  robustness of our floating point implementation did not require
  using them.
\end{itemize}



%% aistats Submission
%% \input{clustermodel-exp}

%% Although the BFGS optimizer works well in practice, computing the
%% exact center remains of interest.  Because the center may be located
%% on a cell of the arrangement of any dimension, an exact number type
%% (an algebraic number of degree 2) is needed to handle so-called exact
%% predicates \cite{cclt-dcska-09}. Indeed, an exact algorithm using
%% floating point numbers will inevitably be plagued by rounding
%% errors~\cite{kettner2008classroom}.  The reader is referred to SI
%% Section \ref{sec:exact-center} for the sketch of such an algorithm.

\begin{figure*}[htbp]
\centering
\begin{tabular}{cc}
\includegraphics[width=0.35\textwidth]{\wfig/yeast_std-SC-values-R2-function-of-eta.png} & \includegraphics[width=0.45\textwidth]{\wfig/yeast_std-SC-stacked-barplot.png}\\
{\scriptsize {\bf (A)} $\Feta$ and $R^22$ function of $\eta$} & {\scriptsize {\bf (B)} Step types in trajectory of \algoexact}\\
%%
\includegraphics[width=0.35\textwidth]{\wfig/yeast_std-SC-time-ratios.png} & \includegraphics[width=0.35\textwidth]{\wfig/yeast_std--sescmu1dot00-sesceta0dot90--randinit0-projection-plot.png}\\
{\scriptsize {\bf (C)} Ratio $\timeexact / \timebfgs$} & {\scriptsize {\bf (D)} Projection plot, $\eta=0.9$}
\end{tabular}
\caption{{\bf Yeast landsat.} This dataset features  $6435$ points in dimension $d=9$.
%%
{\bf (A)} Dual plot 
{\bf (B)} Steps types as a function of $\eta$
{\bf (C)} Running times $\timeexact$ vs $\timelbfgs$
{\bf (D)} Projection plot with inliers and outliers
}
\label{fig:landsat}
\end{figure*}

\section{Spherical clusters: experiments}
\label{sec:experiments}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Implementation}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

Our implementation of the algorithm from Sec. \ref{sec:optim} using
python and numpy is denoted \algoexact and is termed the {\em exact
  method}. 
%(We slightly abuse terminology since our implementation
%does not comply with the real RAM model.)
It is available from the Core tier of the 
\sblwebhref,  in the
\href{https://sbl.inria.fr/doc/Cluster_spherical-user-manual.html}{\toblue Cluster spherical} package.
%%
We compare the solution
yielded by \algoexact against that yielded by \algobfgs -- the
optimization being done with BFGS.  The cluster centers are denoted
$\optexact$ and $\optbfgs$ respectively.

Calculations were run on a DELL precision 5480 equipped with 20 CPUs
of type Intel(R) Core(TM) i9-13900H, 32Go or RAM, and running FedoraCore~42.
%%
%% All calculations took less than a handful of seconds, so that running times
%% are not further documented.
\ifTWOCOLUMNS
\begin{figure}[H]% or !htb or H
\centerline{\includegraphics[width=1.06\linewidth]{\wfig/figure_multitraj.png}}
\caption{
{\bf Spherical cluster: illustrations on a toy 2D dataset.}
{\bf (Left)} Trajectories from five different starting points, with $\eta = 0.5$ (Line/Sphere descents in blue/orange). 
%%
{\bf (Right)} Evolution of the cluster center for $\eta$ in $[0.1, 0.9]$ be step of 0.1.} 
\label{fig:2D-illustrations} 
\end{figure} 
\else
\begin{figure}[H]% or !htb or H
\centerline{\includegraphics[width=.75\linewidth]{\wfig/figure_multitraj.png}}
\caption{
{\bf Spherical cluster: illustrations on a toy 2D dataset.}
{\bf (Left)} Trajectories from five different starting points, with $\eta = 0.5$ (Line/Sphere descents in blue/orange). 
%%
{\bf (Right)} Evolution of the cluster center for $\eta$ in $[0.1, 0.9]$ be step of 0.1.} 
\label{fig:2D-illustrations} 
\end{figure} 
\fi


\subsection{Contenders, datasets and statistics}
\label{sec:contenders-stats}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\paragraph{Contenders.}
We challenge \algoexact with two contenders denoted \algobfgs and
\algolbfgs respectively, using the BFGS and L-BFGS-B solvers provided
by {\tt scipy.optimize}.  Note that the latter uses an approximation
of the Hessian--as opposed to a $O(d^2)$ sized matrix. 

\paragraph{Medium dimensional (MD) datasets. } 
We ran experiments on ten  standards datasets used in clustering
experiments \cite{celebi2013comparative,carriere2025improved}, with
size $n\in [1484,200000]$ and dimension $d\in [9,77]$ -- Table
\ref{tab:datasetCLU}. Following common practice, on a per dataset
basis, we perform a min-max normalization on the coordinates to avoid
overly large ranges.

\paragraph{High dimensional (HD) datasets. } 
We use two datasets to explore the effect of high dimensionality.
%%
The  \datasetHMM dataset consists of  $N=1443$ protein sequences whose
biological function is unknown~\cite{vicedomini2022profileview}.  To
identify putative functions, each sequence is scored by $d=400$ Hidden
Markov Models (HMM) corresponding to major known protein functions,
yielding a $d$-dimensional point. Carbone et al. perform hierarchical
clustering on these points (Ward's method), yielding 16 clusters
(sizes in 11..176) of sequences expected to have identical functions.
%We further the geometric analysis of these clusters using our SESC
%model, computing a dimension, a center, and a radius.
%%
The 
\href{https://archive.ics.uci.edu/dataset/167/arcene}{Arcene} dataset
contains mass-spectrometric data meant to distinguish 
cancer versus normal patients, and has shape  $(n,d)=(900, 10000)$.
The $d=10000$ features correspond to protein abundances in human sera,
to which distractor features with no predictive power have been added.

\paragraph{Parameters.}
For each dataset, we explore values of  $\eta$ in $[0.1,0.9]$ by steps of 0.1 -- nine values in total.

\paragraph{Statistics and plots.} We define (Fig. \ref{fig:landsat} and SI):\\
%%
\sbulem{Spherical cluster square radius $R^2$.}
The square radius with respect to which the power distance is computed, that
is $\eta \hat{\sigma}^2$ -- see Eq. \ref{eq:stdevdist}.

\sbulem{Projection plot.} The plot of all points (data points, center
of mass, SC centers) onto the first two principal directions. Inliers
(resp. outliers) are displayed in blue (resp. orange).
%%
The number of outliers identified by our cluster model,
is denoted $\numoutliersSC$. 
Similarly  $\numoutliersCOM$ stand for the number be outliers defined with respect to a sphere
of the same radius centered at the center of mass.

\sbulem{\bf Dual plot}. Reports $\Feta$ and $R^2$ as a function
of $\eta$.  (NB: $R^2$ values are represented negated on this plot.)

\sbulem{\bf Stacked barplot.} The plot function of $\eta$ counting the
number of steps of each type (line descent, sphere descent,
teleportation) in \algoexact.

\sbulem{\bf Time ratio plots.} The plots for $\timeexact / \timebfgs$ and
$\timeexact / \timelbfgs$, comparing the running times of 
\algoexact against those of \algobfgs and \algolbfgs
respectively.

\sbulem  {\bf Average outlier cost plots.}
The plots  $\Feta(\optexact)/ \numoutliersSC$ and $\Feta(\optexact)/ \numoutliersCOM$.

\sbulem{\bf Outlier ratio plot.} The plot $\numoutliersCOM/\numoutliersSC$.

\sbulem{\bf Distance between points plot.} The plot 
comparing the distances between three special points:
$\optexact$, $\optbfgs$, and the projection median 
from~\cite{durocher2017projection}.



%% \input{clustermodel-experiments.tex}

\subsection{Spherical cluster model}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\begin{comment}
\ref{fig:DatasetCLU-times-values-exact-BFGS}
\ref{tab:DatasetCLU-times-exact-BFGS}
\ref{tab:DatasetCLU-values-exact-BFGS}

\ref{fig:DatasetCLU-times-values-exact-LBFGS}
\ref{tab:DatasetCLU-times-exact-LBFGS}
\ref{tab:DatasetCLU-values-exact-LBFGS}
\end{comment}

%% nb, ref is {fig:DatasetCLU-LBFGS}

\paragraph{Trajectories and centers.}
We build up an intuition by observing the trajectories followed by our
exact solver when varying the starting point, on a simple toy 2D
example (Fig. \ref{fig:2D-illustrations}(A)).  We also note that even
for such simple cases, the center moves in a complex way when varying
$\eta$ (Fig. \ref{fig:2D-illustrations}(B)). 
 
\paragraph{Running times and the burden of dimensionality.}
For the dataset \datasetCLU, we first inspect running times using the
ratios $\timeexact / \timelbfgs$ (Fig. \ref{fig:DatasetCLU-times-values-exact-BFGS}, Tab. \ref{tab:DatasetCLU-times-exact-BFGS}),
%%
and $\timeexact / \timelbfgs$  (Fig. \ref{fig:DatasetCLU-times-values-exact-LBFGS}, Tab. \ref{tab:DatasetCLU-times-exact-LBFGS}).
%%
Using median values, the comparison shows that \algoexact is faster than \algobfgs up to up
 to $\eta=0.7$ included, while \algoexact is faster than \algolbfgs up
 to $\eta=0.3$ included. Increasing the value of $\eta$ results in
 larger spheres and more complex arrangements, whence the burden
 observed.

We perform the same analysis for the  dataset \datasetHMM,
for 
ratios $\timeexact / \timelbfgs$ (Fig. \ref{fig:ProtHMM-times-values-exact-BFGS}, Tab. \ref{tab:ProtHMM-times-exact-BFGS}),
%%
and $\timeexact / \timelbfgs$  (Fig. \ref{fig:ProtHMM-times-values-exact-LBFGS}, Tab. \ref{tab:ProtHMM-times-exact-LBFGS}).
%%
Using median values again, \algoexact is two for four orders of
magnitude faster than \algobfgs and \algolbfgs.

For the Arcene dataset, BFGS turned out to be unpractical.  We observe
that our exact algorithm is between two and five orders of magnitude
faster than \algolbfgs
(Fig. \ref{fig:DatasetHD-times-values-exact-LBFGS}).

Summarizing, \algoexact is orders of magnitude faster
than \algolbfgs and \algolbfgs for datasets of small/intermediate
dimension and small values of $\eta$, and orders of magnitude faster
than these two methods for high dimensional datasets.

\paragraph{Function values.}
Wen now compare the values yielded by the three contenders:
\algoexact vs \algobfgs: Fig. \ref{fig:ProtHMM-times-values-exact-BFGS} and Table \ref{tab:ProtHMM-values-exact-BFGS};
\algoexact vs \algolbfgs: Fig. \ref{fig:ProtHMM-times-values-exact-LBFGS} and Table \ref{tab:ProtHMM-values-exact-LBFGS}.
%%
While these values are on par for all values of $\eta$, 
we note that the approximate solvers 
are more prone to numerical instabilities,
in particular for \datasetCLU and for \datasetHD.

\paragraph{Outliers and the selection of $\eta$.}
As noticed earlier, the SC center depends both on inliers and outliers.
On all datasets processed, the outlier ratio
$\numoutliersCOM/\numoutliersSC$ lies in the interval $\sim [1, 3]$, which illustrates
the stringency of our  criterion to identify such points.

The outlier cost plot $\Feta(\optexact)/ \numoutliersSC$ is of particular interest
to capture the scale/cost of outliers.
The general behavior of this plot is a monotonic decrease
(\eg Fig. \ref{fig:qq-cluster_10}, Fig. \ref{fig:qq-cluster_11}), indicating
that {\em capturing} outliers is getting easier when increasing $\eta$.
However several datasets exhibit a non monotonic behavior
(Fig. \ref{fig:qq-yeast_std}, Fig. \ref{fig:qq-spam_std}), showing that {\em gaps} must
be crossed to capture certain outliers.
\toblack

\begin{comment}
* mean sesc cost/outlier, interesting cases:
yeast_std (1484,9)   : bumps
mfeat_std (2000, 77) : monotonic decrease
spam_std (4601, 58) : max then decrease
optdigits_std (5620, 65)
letter_std (20000,17)
\end{comment}

\subsection{Projection median}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

We also compare $\optexact$ against the projection median computed as
a weighted average \cite{durocher2017projection}.  As expected, their
distance increases as a function of $\eta$, showing that the cluster
center behaves as a parameterized point set center. See Supporting
Information, plots {\em Distance between points} plots.

%%
%%Similar observations hold for the other datasets processed--see SI.

\subsection{Discussion: complexity in practice}
\label{sub:comp_practice}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\paragraph{Number of steps and multiplicity of cells.} For all
datasets and whatever the value of eta, we checked that the number of
cells traversed is negligible with respect to the worst case
complexity of the arrangement. We also checked
that in practice the cells visited are so only once, contrary to what
can be found in pathological sphere configurations with bad starting
points. We draw a comparison with the celebrated simplex algorithm,
which has exponential complexity in the worst case, and yet stays in
use after almost 80 years. Our framework is similar in spirit, as the
arrangement is exponential in dimension while the algorithm is
effective in practice. Note that we were unable to build an example
where the number of steps is anywhere near the total number of cells. The fact
that our trajectory benefits from a {\em warm start} (warmer as eta
decreases) at the center of mass certainly helps in reducing the
number of cells to be crossed before reaching the minimum, explaining
the efficiency of our algorithm against classical methods (whose
underlying trajectory is not stopped when crossing a cell) when eta is
not close to $1-1/n$.

\paragraph{Behavior in high-dimensions.}  In higher dimensions, our
algorithm outperforms the BFGS and L-BFGS. The main factor for this
behavior resides in computation times of the steps which are cubic in
the number of spheres containing the current point of the
trajectory--see Section \ref{sub:complexity}). In practice, this
number is small, on average between 2 and 3 as shown by our
experiments: a vast majority of steps are either \algline procedures
or \algsphere on a small number of spheres--rarely more than five even
in high dimensions.


\section{Outlook}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Spherical clusters embedded in affine spaces of fixed dimension provide
useful insights into the geometry of high dimensional point clouds.

Our work shows spherical clusters are well defined by a non smooth
strictly convex problem.  We also show that this optimization problem
is well poised and can be solved by an exact iterative procedure
following a semiflow on a stratified complex defined by an arrangement
of spheres. Quite remarkably, BFGS also solves all the instances we
processed to satisfaction.
%%
Yet, the exact solver is orders of magnitude faster than BFGS
based heuristics for high dimensional datasets (say $d>100$), and for
dataset of medium dimensionality and small values of $\eta$.
%%
Our experiments also show that the center of spherical clusters behave
as a high dimensional median parameterized by the fraction $\eta$ of
the variance of distances between the cluster center and all points.

Our work calls for future developments in theory and in practice.

From a theoretical standpoint, understanding the complexity of our
exact method as a function of $\eta$ appears as a challenging problem.

\toblue From a practical standpoint, spherical and affine clusters
were proposed as mixtures components.
%%
However, fitting such mixtures is a challenging non convex problem
which commands to monitor the quality of the fit and the model complexity.
%%
To the best of our knowledge, two main strategies  have been explored
for this task. 
%%
The first one is based on split/merge/delete operations on components of the mixture,
a very demanding task~\cite{kasarapu2015minimum}.
%% Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions
%%
The second one consists of Expectation-Maximization based strategies~\cite{dempster1977maximum},
possibly combined with model control using \eg the minimum message length
\cite{figueiredo2002unsupervised}. This is also a demanding strategy, in particular to control
the singularization of the components and their number.
%%  title={Unsupervised learning of finite mixture models},

Defining spherical clusters embedded into affine spaces of the varying
dimensionality appears as a very appealing choice, but a non trivial
task.  If successful, we anticipate that such models will prove
extremely useful in data analysis at large, providing compact clusters
capturing the intrinsic dimension of the data, that could also be used
to define stratified complexes. 

\toblack
 




