

%% The abstract paragraph should be indented \nicefrac{1}{2}~inch (3~picas) on
%% both the left- and right-hand margins. Use 10~point type, with a vertical
%% spacing (leading) of 11~points.  The word \textbf{Abstract} must be centered,
%% bold, and in point size 12. Two line spaces precede the abstract. The abstract
%% must be limited to one paragraph.

\begin{abstract}
Subspace clustering aims at selecting a small number of original
coordinates (features) so that clusters are clearly identified in
those subspaces. Subspace techniques rely on parametric cluster models
including affine, spherical, Gaussian cluster models--to name a few.
To go beyond fully dimensional spherical cluster models and affine
clusters of arbitrary dimension, we introduce {\em Subspace-embedded
  spherical clusters}, a novel cluster model for compact clusters of
arbitrary intrinsic dimension.  The well poised nature of such
clusters is established via the study of an optimization problem
relying on an arrangement of hyper-spheres. This arrangement is used
to exhibit a piecewise smooth strictly convex function, amenable to
non smooth optimization.  

We illustrate the merits of the SESC model via comparisons against
projection medians and the distance to the measure, and for clustering.
\end{abstract}

%% We use our novel cluster model in a \kmeans like clustering algorithm
%% involving an improved seeding procedure and the Bayesian Information
%% Criterion (BIC) for model selection.
%% %%
%% We illustrate our algorithm on synthetic and real
%% datasets, obtaining mixtures with components of
%% heterogeneous dimension, an appealing starting point to further mine
%% the data.


%% present experiments on a variety of datasets, both synthetic and real
%% world.  These experiments illustrate a clear reduction of the
%% dispersion criterion.  The identification of mixtures involving
%% clusters of various dimensions illustrates the potential of such
%% unsupervised techniques to study complex phenomena.




\begin{comment}%% TL/DR
This work introduces Subspace-embedded spherical clusters, a novel
cluster model for compact clusters of arbitrary intrinsic dimension,
providing geometric insights into the sample points grouped together.
\end{comment}

%% \noindent{\bf Keywords:} subspace clustering, subspace embedded
%% spherical clusters, arrangements of hyper-spheres, non smooth
%% optimization, kmeans-++, smart seeding, mixture models.

\noindent{\bf Keywords:}
subspace clustering, spherical clusters, centerpoints, medians, 
non smooth optimization

\section{Introduction}%%  (1.5 page)}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%% \begin{itemize}
%% \item clustering and hardness
%% \item kmeans pp and approx guarantees
%% \item num of clusters and comp clusterings
%% \item dim reduc and clustering
%% \item more general cluster models
%% \item model selection AIC, BIC, MDL
%% \item ICA
%% \end{itemize}


\paramini{Clustering methods.}  Clustering, namely the task
which consists in grouping data items into dissimilar groups of
similar elements, is a fundamental problem in data analysis at
large~\cite{xu2005survey}.  Existing clustering methods
may be ascribed
to four main tiers.
%%
{\em Hierarchical clustering} methods typically build a dendogram whose
leaves are the individual items, the grouping aggregating similar
clusters~\cite{rod-peh-pcsa-73}.
%%
In {\em density based clustering} methods, a density estimate is
computed from the data, with clusters associated to the
catchment basins of local maxima~\cite{cheng1995mean}. Topological persistence may be
used to select the significant maxima~\cite{chazal2013persistence}.
%%
In {\em spectral clustering} methods,  clusters are defined from the
top singular vectors of the matrix representing the data (or their
similarity)~\cite{von2007tutorial}.
%%
{\em k-means and variants} aim at grouping the data points into a
predefined set of $k$ clusters so as to minimize the sum of
intracluster variance.  Such methods aim at solving a NP-hard
optimization problem, and the so-called smart-seeding strategy
\kmeanspp provides guarantees (in terms of expectation) on the \kmeans
functional~\cite{arthur2007k}. In practice, this strategy is
superseeded by a greedy {\em inertia} based criterion which consists
of picking a seed amidst a set of candidates--see \cite{arthur2007k} and
the \sklearn implementation of \kmeanspp.
%%
These methods are related to the problem of fitting (Gaussian) mixtures using 
Expectation-Maximization \cite{dempster1977maximum,kasarapu2015minimum}.
%%
We note in passing that the variety of clustering methods prompted the
development of methods to estimate the relevant number of
clusters--\eg the {\em elbow} method \cite{ng2012clustering}, as well
as methods to compare two clusterings~\cite{cazals2019comparing}.
\medskip

%% example, the smart seeding strategy of k-means yields an algorithm
%% with an approximation guarantee~\cite{arthur2007k}; yet, k-means++
%% still suffers from instabilities when the number of centers used is
%% larger than the {\em exact} number of clusters, as the clustering
%% obtained depends on the initial distribution of centers within the
%% clusters~\cite{von2007tutorial}.


\paramini{Cluster models.}  Setting aside complexity issues and the
choice of the number of clusters, the previous methods do not provide
any insight on the geometry of clusters, \eg their intrinsic
dimension.
%%
This limitation prompted the development of \ksubspace clustering
techniques, which belong to two tiers.
%%
The first one consists of methods in the lineage of (affine) sparse
subspace clustering
(SSC/ASSC)~\cite{elhamifar2013sparse,soltanolkotabi2012geometric,li2018geometric}.
These two step methods write each data point as a sparse linear
combination of other data points, and the coefficients found are used
to obtain the clusters via spectral clustering.  Their correctness
relies on the ability of spectral clustering to separate the clusters,
which relies on conditions that may not be met in practice.
%%
The second tier involves clustering methods using an explicit \ie
analytical cluster
model~\cite{parsons2004subspace,wang2009ksubspaces}.  These techniques
face two difficulties.  The first is to avoid overfitting using a
complexity penalty (AIC, BIC, MDL, MML)~\cite{grunwald2007minimum}, as
a richer model always decreases the fitting error--\eg a plane better
fits noisy data distributed along a line than the line itself.
%%
%% Such regularized clustering methods are somewhat similar the removal
%% of noisy features using dimensionality reduction to ease the
%% clustering task~\cite{niu2011dimensionality}.
%%
The second is to obtain the cluster mixture representing the data, a
task usually addressed using
Expectation-Maximization~\cite{dempster1977maximum,wu1983convergence}.
However, the main difficulty for heterogeneous mixtures (\eg clusters
of varying dimension) is to navigate in the space of models, a
difficult question typically undertaken via (split, merge, delete)
operations on the mixture components \cite{kasarapu2015minimum}.

\begin{comment}
In using parametric cluster model, an important question is the method
used to fit the cluster parameters. In \kmeans like algorithms, two
steps (assignment of samples to clusters, update of cluster
parameters) are mingled.  This strategy is similar to that used in the
clustering version of EM, which estimates both the mixture parameters
and the labels via completed likelihood maximization
\cite{celeux1992classification}. In this latter case though, a soft
assignment--as apposed to hard assignment in \kmeans--is
used. Interestingly, the relationship between both strategies has been
elucidated in information theoretical
terms~\cite{kearns1998information}.

Finally, we note in passing that such methods bear similarity with but
are quite different from co-clustering methods, which jointly cluster
data and features, finding blocks in a rectangle data matrix
~\cite{del2015non}. Indeed, subspace methods aim at characterizing
couplings between features and the intrinsic dimension of clusters.
%%
Inferring the geometry of clusters can also be done by mixing
Unsupervised and supervised techniques.
%%
For example, \kmeans can be used to define class labels, and linear
discriminant analyses used to perform subspace
selection~\cite{ding2007adaptive}.  
\end{comment}


\paramini{Geometric medians.}
Cluster models providing insights on the geometry of a point
set also call for a discussion of high dimensional medians.
%%
The Fermat-Weber point is the point from $\Rd$ minimizing the sum of
Euclidean distances to all data points.  Unfortunately, this point is
hard to compute and unstable
\cite{kupitz1997geometric,weiszfeld1937point,bajaj1988algebraic,cohen2016geometric}.
%%
Building on Helly's theorem, a median can be defined as any point
whose Tukey depth is at least $\geq n/(d+1)$~\cite{tukey1975mathematics}.
%%
\ifLONG
(The Tukey depth or halfspace depth of a point $x$ is the smallest fraction of points of
any closed half-space containing $x$~\cite{tukey1975mathematics}.)
It is, however, challenging to compute. The classical randomized
algorithm~\cite{clarkson1993approximating} has been derandomized in
\cite{miller2009approximate}.  The complexity is subexponential in $d$,
but to the best of our knowledge, the algorithm is not practical.
\else
Such a point is, however, hard to compute 
\cite{clarkson1993approximating,miller2009approximate}.
\fi
%%
The projection median is defined by projecting the dataset onto random
lines, computing the univariate median for each projection, and
computing a weighted average of the data points responsible for these
univariate
medians~\cite{durocher2009projection,basu2012projection,durocher2017projection}.
It is an elegant, stable and remarkably effective generalization of
the univariate median.
%For statistical properties of these constructs, the reader is referred to
%~\cite{donoho1983notion,lopuhaa1991breakdown,davies2007breakdown}.

\paramini{Contributions.}  Our work focuses on subspace clustering
using cluster analytical models.  Two types of such clusters have been
proposed recently~\cite{wang2009ksubspaces}: affine and spherical
clusters. The former accommodate potentially unbounded (large)
clusters of arbitrary dimension.  The latter are fully dimensional
clusters which essentially use as distance function the power distance
with respect to a sphere whose radius is (a fraction of) the variance
of the cluster points to the cluster center.

We make four contributions going beyond these works.
%
First, we study the mathematical structure of the center point
optimization for spherical clusters. We establish a result of strict
convexity, proving the uniqueness of the center point, in a way amenable to 
non smooth optimization.
%%
Second, we introduce subspace embedded spherical clusters (SESC)
clusters, \ie spherical clusters embedded in affine spaces of
arbitrary dimension.
%%
Third, we present the geometric insights
yielded by SESCs of full dimension, 
via a comparison against the so-called Distance to the Measure,
and to the projection median.
%%
Fourth, we combine our SESC model and EM to identify complex mixtures
of clusters of varying dimension.

\begin{comment}
Finally, we use SESCs  in a clustering algorithm performing
model selection.  We develop a smart seeding heuristic beyond
the usual smart seeding from \cite{arthur2007k}, in the spirit of
split/merge operations on mixture components
~\cite{kasarapu2015minimum}. Our clustering scheme solves a
limitation of the approach from \cite{wang2009ksubspaces}, where
\quoteen{we choose the model with the smallest dispersion.}  -- a
choice favoring higher intrinsic dimension.
\end{comment}

All proofs are provided in the Supporting Information.

\section{Parametric cluster models and  Subspace Embedded Spherical Clusters} 
\label{sec:sesc}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Notations}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

Let $D$ be a set of $n$ points in $\Rd$. Let $C_1, \dots, C_k$ be
the clusters.  We consider a set of subsets $D_1,\dots,D_k$ forming a
partition of $D$, with $D_\ell$ the set of points associated to cluster
$C_\ell$.
%%
The unbiased variance estimate for distances within cluster $D_\ell$ of center $c_\ell$ satisfies
\begin{equation}
\stdevdist* =  \frac{1}{n-1}\sum_{x_i\in D_\ell} \vvnorm{x_i - c_\ell}^2
\end{equation}

Let $A = c + V$ be an affine space, with $c$ a
point in $\Rd{d}$ (think cluster center), and $V$ a vector space.  For any
point $x \in \Rd$, we denote by $\comppara{x-c}$ the orthogonal
projection of the vector $(x-c)$ onto $V$, and by $\compperp{x-c}$ the
orthogonal projection on $V^\perp$.

When fitting a model, the sum of squared distances from samples to the
model is called the {\em residual sum of squares (RSS)}, or dispersion
for short.

\subsection{Parametric cluster models}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%



%% the subsets of $D$ containing all points associated with the corresponding cluster. A point belongs to one and only one cluster, and one does not want empty clusters, so $D_1,\dots,D_k$ is a partition of $D$.

Take $C_\ell$ for $\ell \in \intrange{1}{k}$ and suppose $D_\ell$ is
known. Cluster $C_\ell$ is described by the parameter set $\theta_\ell =
(\theta_{\ell,1}, ..., \theta_{\ell,r})$ and a function $d_\ell:
(x,C_\ell(\theta_\ell)) \mapsto d_\ell(x,C_\ell(\theta_\ell))$, that
is some distance from a point to the cluster. We call the description
of $C_\ell$ by the function $d_\ell$ a {\em parametric cluster model}.
%% that depends on $\theta_\ell$.
%%
We decompose the clustering problem into two sub-problems concerned with the minimization of
a  dispersion term based on squared distances:
%%
\begin{problem}[Cluster optimization] 
\label{pb:cluster-optim}
Let  $C_\ell$ be a parametric cluster. {\em Cluster optimization} is the optimization
problem seeking the cluster parameters minimizing the dispersion 
\begin{equation}
\label{eqn:generaloptimsinglecluster}
\min_{\theta_\ell} \kmfunc, \text{ with } \kmfunc = \sum_{x \in D_\ell} d_\ell(x,C_\ell(\theta_\ell))^2.
\end{equation}
\end{problem}
Finding the partition $D_1, \dots, D_k$ of $D$ minimizing the total dispersion yields
the global problem:
%%
\begin{problem}[Clustering] 
\label{pb:clustering}
Let a clustering be specified by the  cluster indicator matrix  $H \in \{0,1\}^{n \times k}$, with $x_i \in D_\ell \iff H_{i,\ell} = 1$. 
%%
{\em Clustering} is the optimization problem seeking the best partition and cluster parameters:
\begin{equation}
\label{eqn:generaloptimmultipleclusters}
\min_{\theta_1, \dots, \theta_k, H} \Kmfunc, \text{ with } 
\Kmfunc = \sum_{\ell = 1}^k \sum_{x \in D_\ell(H)} d_\ell(x,C_\ell(\theta_\ell))^2,
\end{equation}
\end{problem}
%%
The celebrated \kmeans algorithm naturally follows this model, with function
$d_l$ the squared distance to the cluster center.  Problem
\ref{pb:clustering} however is much more general, making possible to
vary the distance $d_\ell$ on a per cluster basis, adopting suitable
local cluster models.

\paramini{Affine clusters.}
As a first generalization of \kmeans, one can consider the distance from a data point
to an affine subspace, yielding \ksubspace clustering~\cite{wang2009ksubspaces}:
%%
\begin{definition}[Subspace cluster] \label{def:subspacecluster}
Let $A = c + V$ be some affine subspace of $\Rd$ where $c \in
\Rd$ is a point and $V$ is an $m$-dimensional linear subspace. The
    {\em subspace cluster} $C_\ell(A)$ is a cluster, where the
    distance from a point $x$ to the cluster is the distance to the
    subspace :
\begin{equation}
\label{eqn:subspacedistance}
d(x, C_\ell(A))^2 := d(x,A)^2 = \vvnorm{ \compperp{ x-c}}^2.
\end{equation}
%% with  $\compperp{x-c)}$ is the orthogonal projection of $x-c$ onto $V^\perp$.
\end{definition}

\paramini{Spherical clusters.}
As noticed in Introduction, affine clusters may be confounded by
noise, and suffer from their non compact nature.  This latter aspect
can be taken care of using {\em spherical clusters}. To see how,
recall that the power of a point $x$ with respect to a sphere $S(c,r)$
is defined by $\powerps{x}{S} = \vvnorm{x-c}^2 - r^2$.
%%
Following \cite{wang2009ksubspaces}, we define:
%%
\begin{definition}[Spherical cluster] 
\label{def:sphericalcluster}
Let $\eta \in ]0,1[$ be is a hyperparameter, and let $c_\ell$ be a
    point called the {\em cluster center}.  Given the set $D_\ell$,
    the distance function associated to the {\em spherical cluster}
    $C_\ell(c_\ell)$ reads as
%%
\ifTWOCOLUMNS
\begin{align}
\label{eqn:sphericaldistance}
\begin{split}
d(x,C_\ell(c_\ell))^2 
&:=  \max \left(0, \vvnorm{x-c_\ell}^2 - \eta \stdevdist*  \right)  \\
&= \max \left(0, \powerps{x}{S(c_\ell, \sqrt{\eta} \stdevdist)}\right).
\end{split}
\end{align}
\end{definition}
\else
\begin{equation}
\label{eqn:sphericaldistance}
d(x,C_\ell(c_\ell))^2 
:=  \max \left(0, \vvnorm{x-c_\ell}^2 - \eta \stdevdist*  \right)  = \max \left(0, \powerps{x}{S(c_\ell, \sqrt{\eta} \stdevdist)}\right).
\end{equation}
\end{definition}
\fi

The rationale of this definition is twofold: first, the radius of the
ball takes into account the dispersion of distances to the center;
second, a point within that ball has a null cost.  As opposed to
\kmeans, this distance takes into account the cluster geometry, not
just the center.


%% (parameter that is tunable by the user but not optimized in the problem \ref{pb:cluster-opt}). Denote $\hat{\sigma}$ the unbiased estimator of the variance of the dataset. A point inside the ball of center $c_\ell$ and radius $\eta \hat{\sigma}$ is added in the cluster for free, while a point outside the ball costs its power with respect to the ball.

\subsection{Subspace-embedded spherical cluster}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

To obtain {\em tight} clusters dimension-wise--see
Sec. \ref{sec:clustering}, we define spherical clusters within affine
spaces of prescribed dimension:
%%
\begin{definition}[Subspace-embedded spherical cluster] 
\label{def:cluster}
Given a set of points $D_\ell$, we define a {\em subspace-embedded
  spherical cluster} as a spherical cluster of center $c$ embedded into 
an affine subspace $V$ of $\Rd$.
%%    
%% The cluster $C_\ell$ is described by a point $c$ and a linear subspace
%% $V$, where $c$ is the center of the ball of radius
%% $\sqrt{\eta}\hat{\sigma}$ as in definition \ref{def:sphericalcluster},
%%
%% Finally, let $\mu \in ]0,1]$ be a hyperparameter that controls the
%%     balance between the two parts of the distance. The distance to the
%%     cluster is defined as :
%% 
The corresponding distance  reads as
\ifTWOCOLUMNS
\begin{align}
\label{eqn:distance-sesc}
\begin{split}
& \dSESC^2 = \vvnorm{\compperp{x-c}}^2 +\\
& \mu \max \left(0, \vvnorm{\comppara{x-c}}^2 - \eta \frac{1}{n-1}\sum_{x_i\in D_\ell} \vvnorm{ \comppara{x_i-c}}^2 \right)
\end{split}
\end{align}
\else
\begin{equation}
\label{eqn:distance-sesc}
\dSESC^2 = 
\vvnorm{\compperp{x-c}}^2 + \mu \max \left(0, \vvnorm{\comppara{x-c}}^2 - \eta \frac{1}{n-1}\sum_{x_i\in D_\ell} \vvnorm{ \comppara{x_i-c}}^2 \right)
\end{equation}
\fi
\end{definition}
%%
For a fixed data set $D_\ell$, the sum of the previous quantities
yields the dispersion to be minimized.
%%
\begin{comment}
For a fixed data set $D_\ell$, obtaining the best cluster
$C_\ell(c,V)$ yields the following optimization problem:
%% as:
\begin{equation}
\label{eqn:optimizationproblem}
%%\min_{c,V} \sum_{x_i \in D_\ell} \| (x-c)_\perp \|^2 + \mu \max \left(0, \| (x-c)_\parallel \|^2 - \eta \frac{1}{n-1}\sum_{x_i\in D_\ell} \|(x_i-c)_\parallel\|^2  \right)
\min_{c,V} \sum_{x\in D_\ell} \dSESC^2.
\end{equation}
\end{comment}

\begin{remark}
\label{rmk:eta-mu}
The distance of Eq. (\ref{eqn:distance-sesc}) takes into account the
distance to the subspace (affine cluster, definition
\ref{def:subspacecluster}) as well as the distance to the sphere
(spherical cluster, definition \ref{def:sphericalcluster}). It is a
more general model of cluster, that avoids the problems discussed
earlier. 
%%
The value of $\eta$ is related to the noise level of the data,
which can be estimated using \eg distances to k-nearest neighbors
\cite{biau2011weighted}.
%%
The fine-tuning of the hyperparameter $\mu$ allows to
control the balance between the orthogonal and within-subspace
distances.
\end{remark}

%% In the following, we focus on the algorithm to compute the cluster that best fits the data when the points of $D_\ell$ are known, before putting it in the context of a Lloyd-like algorithm to compute the clustering.

%% \subsection{Clustering algorithm}
%% %%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%
%% Otherwise, criteria such as the   Bayesian Information Criterion (BIC),
%% which assess the model based on its fit to the data and its complexity, can be used.
%% Yet another  criterion is the Akaike Information Criterion.

\section{Spherical Cluster Optimization}%% (3 pages?)}
\label{sec:sesc-opt}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Spherical clusters: functional decomposition}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

In this section, we study problem \ref{pb:cluster-optim} for the spherical
cluster model (Def.  \ref{def:sphericalcluster}).  That is, for a
fixed data set $D_\ell$, we aim at minimizing
\begin{equation}
\label{eqn:cluster-optim-sph}
\min_{c \in \Rd} F_\eta(c),
\end{equation}
%%
with
%%
\begin{equation}
\label{eq:Feta}
\ifTWOCOLUMNS
\small
\else
\fi
\Feta{c} := \sum_{x_i \in D_\ell} \max \left(0, \| x_i-c \|^2 - \eta \frac{1}{n-1}\sum_{x_j\in D_\ell} \|x_j - c\|^2  \right).
\end{equation}
%%
To study the previous function, for each $x_i \in D_\ell$, let
\begin{equation}
\begin{cases}
f_{\eta,x_i}(c) := \max \left(0, \ftildeeta{c} \right),\\
\text{ with } \ftildeeta{c}  := \| x_i-c \|^2 - \eta \frac{1}{n-1}\sum_{x_j\in D_\ell} \|x_j - c\|^2.
\end{cases}
\end{equation}
Observe that
\begin{equation}
\label{eqn:fetasum}
\Feta{c} = \sum_{x_i \in D_\ell} f_{\eta,x_i}(c)
\end{equation}
We first analyze the sub-functions $\ftildeeta$ and $f_{\eta,x_i}$ in order to analyze the main function $\Feta$.




\subsection{Geometry of the sub-functions}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

The function $\Feta$ naturally decomposes into $f_{\eta,x_i}$s, which we study first.
In the sequel, we assume that (i) the set $D_\ell$ is fixed,  (ii) $x_i \in D_\ell$, and 
(iii) $0 < \eta < 1 - \frac{1}{n}$.

\begin{lemma}[Convexity of $\ftildeeta$] 
\label{lemma:cvxtildef}
    Given $\eta$ and $x_i$ defined as above, we have
    \begin{equation}
        \ftildeeta  \text{ is convex } \iff \eta \le 1 - \frac{1}{n}
    \end{equation}
    Moreover, 
    \begin{equation}
        \ftildeeta \text{ is strictly convex } \iff \eta < 1 - \frac{1}{n}
    \end{equation}
\end{lemma}
%%
Studying the function $f_{\eta,x_i}$ benefits from the geometry of the
following {\em sink region} yielding a null cost (Fig. \ref{fig:sink-arrangement}):
%%
\begin{definition}
\label{def:sink}
The {\em sink region}  $B_{x_i}$ of   function $f_{\eta,x_i}$ 
the locii of centers for which $x_i$ has a null
cost for this cluster, that is $B_{x_i} := f_{\eta,x_i}^{-1}\left(\{0\}\right)$.
%%
%% \begin{equation}
%% B_{x_i} := f_{\eta,x_i}^{-1}\left(\{0\}\right).   
%% \end{equation}
\end{definition}
%%
The following results from an elementary calculation:
%%
\begin{lemma}[Geometry of $B_{x_i}$] 
\label{lem:sink-geom}
The sink region is a non-empty closed ball of $\Rd$. 
Let $\eta' := \eta \frac{n}{n-1}$,
and   $\bar{x} = \frac{1}{n} \sum_{x_j \in D_\ell} x_j$.
%%
Its center $c_{x_i}$ and radius $R_{x_i}$ of $B_{x_i}$ satisfy
\begin{equation}
\label{eq:sink-spec}
\begin{cases}
c_{x_i} &= \frac{x_i - \eta'\bar{x}}{1-\eta'},\\
R_{x_i} &= \bigl( \left\| \frac{x_i - \eta'\bar{x}}{1-\eta'} \right\|^2 - \frac{\|x_i\|^2 - \frac{\eta'}{n} \sum_{x_j \in D_\ell} \|x_j\|^2}{1-\eta'} \bigr)^{1/2}.  
\end{cases}
\end{equation}
\end{lemma}
%%
Sink regions make it possible to qualify the individual functions $f_{\eta,x_i}$ as follows:
%%
\begin{lemma}[Analysis of $f_{\eta,x_i}$]
\label{lemma:analysissubfunctions}
Given $x_i$ and $\eta$ defined as above, the function $f_{\eta,x_i}$
is continuous and convex. It values zero on $B_{x_i}$, and its
restriction to $\overline{\Rd \setminus B_{x_i}}$ is a non-zero
quadratic form of $\Rd$, that is strictly convex on every convex
subset of $\overline{\Rd \setminus B_{x_i}}$.
\end{lemma}





\subsection{Arrangement of hyper-spheres underlying the objective function}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%


%% Recall the observation of equation \ref{eqn:fetasum}:
%% \begin{equation*}
%%     \Feta (c) = \sum_{x_i \in D_\ell} f_{\eta,x_i}(c)
%% \end{equation*}

Our analysis of the sub-functions $f_{\eta,x_i}$ shows that they 
are continuous and {\em piecewise quadratic}, zero on $B_{x_i}$,
and identify to the restriction of a strictly convex quadratic form on $\Rd
\setminus B_{x_i}$. Thus, $\Feta$ is also continuous, convex and
piecewise quadratic as a finite sum of such functions.
%%
Thus, finding the optimal cluster center therefore requires
understanding the relationship between all sink regions.

\paramini{Arrangement and convexity.}
An {\em arrangement} of hyper-surfaces is a decomposition of 
$\Rd$ into equivalence classes of points, using their position
with respect to these hyper-surfaces \cite{halperin2017arrangements}.
%%
We apply this concept to the spheres bounding the sink regions (Lemma
\ref{lem:sink-geom}).
%%
Let $\partial B_{x_i} := \overline{B_{x_i}} \setminus
\overset{\circ}{B_{x_i}}$ be the sphere bounding the sink ball
$B_{x_i}$.
%%
For $x \in \Rd$ and $i \in \intrange{1}{n}$, consider the following
signature which states whether point $x$ lies outside/on/inside the
sphere $\partial B_{x_i}$:
\begin{equation}
\label{eq:Pi-sig}
    P_i (x) := \left\lbrace \begin{aligned}
            -&1 \text{ if } x \in \overset{\circ}{B_{x_i}} \\ 
            &0 \text{ if } x \in \partial B_{x_i} \\ 
            &1 \text{ if } x \notin B_{x_i}
    \end{aligned} \right.
\end{equation}
%%
The signature of a point $x$ is the length $n$ vector (one entry w.r.t. each sink defining ball):
%%
\begin{equation}
    \sigma(x) := (P_1(x), P_2(x), \dots, P_n(x))
    \label{eqn:signature}
\end{equation}
The signature defines an equivalence relation, where two points are
equivalent if they have the same signature. Denote $\mathcal{C}_1,
\dots, \mathcal{C}_p$ the equivalence classes, and for each
$\mathcal{C}_k$, denote $\mathcal{C}_{k,\ell}$ the different connected
components--if any.  The $\mathcal{C}_{k,\ell}$ are the so-called {\em
  cells} of the arrangement.  For the sake of simplicity, the
arrangement and its cells are simply denoted
%%
\begin{equation} 
\label{eqn:arrangement}
\calA = \mathcal{C}_1, \dots, \mathcal{C}_p.
\end{equation}
%%
See Fig.  \ref{fig:sink-arrangement} for an illustration.  Note that
generically, $\tau+1 \leq d$ spheres in dimension $d$ intersect along
an $l=d-(\tau+1)$ sphere.


\paramini{Combinatorial decomposition of $\Feta{c}$.}
We characterize a cell $\calC$ of the arrangement with the indices of
the non-zero sub-functions $f_{\eta,x_i}$ on that cell, that is
%% \begin{equation}
%% \supportcell \; : \; \left\lbrace 
%% \begin{aligned}
%%             \calA &\to \mathcal{P}( \intrange{1}{n} ) \\
%%             \calC &\mapsto \left\{ i \in \intrange{1}{n}  : \calC \cap B_{x_i} = \emptyset \right\}
%% \end{aligned} \right.
%% \end{equation}
\begin{equation}
\supportcell \; : 
\calC \mapsto \left\{ i \in \intrange{1}{n}  \text{ such that } \calC \cap B_{x_i} = \emptyset. \right\}
\end{equation}
%%
In the sequel, the center of mass of the points $x_i$ with $i\in
\calC$ is denoted $\bar{x}_{\supportcell{\calC}}$.

We also obtain the following expression for the function $\Feta$ restricted to cell $\calC$:
\begin{lemma} 
\label{lemma:arrangementforpiecewisequadratic}
Function $\Feta[\calC]$ restricted to a cell $\calC$ of the arrangement reduces
to the quadratic form
\begin{equation}
\label{eqn:fetarestricted}
%%\forall \calC \in \calA, 
\Feta[\calC]{c} = \sum_{j \in \supportcell{\calC}} \ftildeeta{c}.
\end{equation}
\end{lemma}

\subsection{Strict convexity and optimization}
\label{sec:strict-convex-optim}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

Our proof that the spherical cluster optimization is well poised
establishes that function $\Feta$ is strictly convex on $\Rd$. The
proof (see SI), studies the variation of $\Feta$ along a line segment
$S=[x,y]\subset \Rd$. More specifically, the proof studies the
variation of $\Feta$ on the intersections between segment $S$ and the
cells of the aforementioned arrangement (SI lemma \ref{lemma:arrangementforpiecewisequadratic},
SI lemma \ref{lemma:intersect}, Fig. \ref{fig:segment-in-arrangement}).
The complete analysis exploiting the
combinatorial structure underling $\Feta$, yields the strict
convexity of $F_{\eta}$ on $\Rd$:
%%
\begin{theorem}[Strict convexity of $\Feta$] \label{thm:strcvxf}
Let $D_\ell$ be a set of $n$ points of $\Rd$, with at least two
distinct points, $\eta > 0$ and $c$ the center of the spherical
cluster sought. The function
\ifTWOCOLUMNS
\begin{align}
\begin{split}
\Feta{c} :=\\ \sum_{x_i \in D_\ell} \max \left(0, \| x_i-c \|^2 - \eta \frac{1}{n-1}\sum_{x_j\in D_\ell} \|x_j - c\|^2  \right)
\end{split}
\end{align}
\else
\begin{equation*}
\Feta{c} := \sum_{x_i \in D_\ell} \max \left(0, \| x_i-c \|^2 - \eta \frac{1}{n-1}\sum_{x_j\in D_\ell} \|x_j - c\|^2  \right)
\end{equation*}
\fi
verifies the following property :
\begin{equation}
\eta < 1 - \frac{1}{n} \implies \Feta \text{ is strictly convex on } \Rd
\end{equation}
Therefore, the optimization problem of
Eq. (\ref{eqn:cluster-optim-sph}) admits exactly one solution in
$\Rd$.
%% \ref{eqn:optimizationproblem-sphericalcluster} 
\end{theorem}


\paramini{Optimization.} Having established the strict convexity of $\Feta$, we address the
computation of the unique minimizer of~$\Feta$. We present two
options.

\paramini{Minimum of $\Feta[\calC]$ on a cell.} From Eq. (\ref{eqn:fetarestricted}), one obtains
the exact expression of the minimum of the function on the cell $\calC$:
\begin{equation}
\nabla \Feta[\calC]{c} = 0 \iff 
c = \cstar{\calC} := \frac{\bar{x}_{\supportcell{\calC}} - \eta' \bar{x}  }{1 - \eta'}.
\end{equation}
%%
Unfortunately,  $\cstar{\calC}$  may not belong to $\calC$--the definition
domain of $\Feta[\calC]$. We present two
methods to solve our optimization problem.

\paramini{Heuristic based on BFGS.}
The Broyden–Fletcher–Goldfarb–Shanno quasi-Newton method (BFGS) is
designed to be very efficient on twice differentiable function by
approximating the Hessian matrix without any matrix inversion (in
opposition to Newton's methods), using the gradient. When the gradient
is not given, it is estimated using finite differences.
%%
For a spherical cluster, the objective function $\Feta$ is not
differentiable (Fig. \ref{fig:Feta-mins}).
%%
However, it is known that BFGS  works well in practice for
non-differentiable functions \cite{lewis2012nonsmooth}--a fact
confirmed by our experiments.

\paramini{Exact algorithm.}  In non smooth optimization, a
point is critical iff the null vector belongs to its
subdifferential~\cite{clarke1997nonsmooth}.  The combinatorial
characterization of the objective function $\Feta$ given in Lemma
\ref{eqn:fetarestricted} makes it possible to compute its so-called
limiting subgradients and therefore its
subdifferential~\cite{clarke1997nonsmooth}.  It can therefore be
optimized using tools recently developed in the realm of non smooth
optimization \cite{griewank2013stable,griewank2016lipschitz} (Section
\ref{sec:nso}).

%% Although the BFGS optimizer works well in practice, computing the
%% exact center remains of interest.  Because the center may be located
%% on a cell of the arrangement of any dimension, an exact number type
%% (an algebraic number of degree 2) is needed to handle so-called exact
%% predicates \cite{cclt-dcska-09}. Indeed, an exact algorithm using
%% floating point numbers will inevitably be plagued by rounding
%% errors~\cite{kettner2008classroom}.  The reader is referred to SI
%% Section \ref{sec:exact-center} for the sketch of such an algorithm.

\section{SESC model: applications}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Implementation}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

All algorithms are implemented in python with numpy. The core python
classes are provided in the supporting information. The complete SESC
package (python code, documentation, test suite) is integrated to a
fully fledged library containing more than 100 packages, and will be
released with the paper--to preserve anonymity.

Calculations were run on a DELL precision 5480 equipped with 20 CPUs
of type Intel(R) Core(TM) i9-13900H, 32Go or RAM, and running Fedora Core 39.
%%
All calculations took less than a handful of seconds, so that running times
are not further documented.

\subsection{Geometric  analysis of point sets}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\paramini{Rationale.}  We illustrate the ability of our SESC model to
provide insights on the geometry of a point set per se. In that case,
the SESC reduces to a spherical cluster, and the only relevant
parameter is $\eta$.  We also provide comparisons against the Distance
to the Measure statistic~\cite{biau2011weighted}, and the projection
median~\cite{durocher2017projection}.

\paramini{The \protHMM dataset.}  As illustration, we use a non
trivial dataset consisting of $N=1443$ protein sequences whose
biological function is unknown~\cite{vicedomini2022profileview}.  To
identify putative functions, each sequence is scored by $d=400$ Hidden
Markov Models (HMM) corresponding to major known protein functions,
yielding a $d$-dimensional point. Carbone et al. perform hierarchical
clustering on these points (Ward's method), yielding 16 clusters
(sizes in 11..176) of sequences expected to have identical functions.
We further the geometric analysis of these clusters using our SESC
model, computing a dimension, a center, and a radius.

\paramini{SESC versus Distance to the Measure.}
The center/radius of the SESC is used
to identify cluster inliers and outliers.
%%
To assess the relevance of the radius, we perform a comparison with
the Distance To Measure statistic (DTM) of a point cloud, defined as
the average distance (we use the L2 norm) between a point and its $k =
\log n$ nearest neighbors \cite{biau2011weighted}.  To backup our
definition of outlier, we study the correlation between our SESC radii
and the median DTM--over all points in the cluster.  In varying the
hyper-parameter $\eta \in [0.3,\dots,0.9]$, two observations stand out
(Fig. \ref{fig:cmp-DTM}): (i) the SESC radius increases with $\eta$ as
expected since outliers become inliers, and (ii) an
excellent correlation with the DTM statistic is retained. Correlations
slightly degrade for $\eta = 0.9$.

\paramini{SESC versus high-dimensional medians.}
As a second analysis, we compare the SESC center against the
projection median, computed with
the second algorithm from \cite{durocher2017projection}. (NB: we use
$\eps=1/100$, resulting in using $\sim 3\times 10^4$ random lines in
dimension $d=400$.)
%%
The systematic inspection of our $N=16$ clusters for various values of
$\eta$ calls for two observations illustrated on two specific clusters
of \protHMM (Fig.~\ref{fig:cmp-com-pm-center}).  For compact clusters
with few outliers, the Euclidean center of mass, the projection
median, and the SESC center are almost interchangeable.
%%
However, for less compact clusters, increasing the value of $\eta$ in
the SESC functional triggers a shift of the center to include more
points.
%%
For a quantitative analysis, for the i-th cluster, let $c_i, m_i,
s_i$ be the center of mass, projection median, and SESC center,
respectively. Consider the ratios $r_i(SESC, COM) =
\vvnorm{c_i-s_i}/\text{median DTM}_i$, and $r_i(SESC, PM) =
\vvnorm{m_i-s_i}/\text{median DTM}_i$.  We compute the median values
of these ratios for the $N=16$ clusters (Table \ref{tab:ratios}).
%%
It appears that the SESC center behaves as a parameterized
{\em center} of the point cloud, the parameter $\eta$ acting as a
trade-off between the compacity of the cluster and the number of
inliers.

\begin{table}[!ht] 
\begin{center}
\begin{tabular}{|l | c c |}
\hline
$\eta$ & $r_i(SESC, COM)$ & $r_i(SESC, PM)$\\
\hline
$0.3$ & $10^{-3}$ & $0.05$\\
$0.5$ & $0.06$ & $0.10$\\
$0.99$  & $1.62$ & $1.64$\\
\hline
\end{tabular}
\end{center}
\caption{{\bf \protHMM dataset: relative positions of the center of mass, projection median, and
SESC center, as a function of $\eta$.}} 
\label{tab:ratios}
\end{table} 

\newcommand{\figwradius}{0.5}
\ifLONG
\begin{figure*}[htbp]
\begin{center}
\begin{tabular}{cccc}
\rotatebox{90}{$\eta=0.3$}& \includegraphics[width=\figwradius\linewidth]{fig-cmp-UoS/dtm-vs-sesc-radius-var0dot9-mu0dot3-eta0dot3.png} & \rotatebox{90}{$\eta=0.5$} & \includegraphics[width=\figwradius\linewidth]{fig-cmp-UoS/dtm-vs-sesc-radius-var0dot9-mu0dot3-eta0dot5.png}\\
%$\eta=0.3$ & $\eta=0.5$\\
\rotatebox{90}{$\eta=0.7$}& \includegraphics[width=\figwradius\linewidth]{fig-cmp-UoS/dtm-vs-sesc-radius-var0dot9-mu0dot3-eta0dot7.png} &\rotatebox{90}{$\eta=0.9$}&  \includegraphics[width=\figwradius\linewidth]{fig-cmp-UoS/dtm-vs-sesc-radius-var0dot9-mu0dot3-eta0dot9.png}\\
%$\eta=0.7$ & $\eta=0.9$
\end{tabular}
\end{center}
\caption{\small {\bf SESC radius x median Distance To Measure: incidence of $\eta$.}
 The cluster center is the     solution of the optimization problem of Eq. (\ref{eqn:cluster-optim-sph}).
Labels correspond to the 16 cluster ids.  }
\label{fig:cmp-DTM} 
\end{figure*} 
\fi


\begin{figure*}[htbp]% or !htb or H
\begin{center}
\begin{tabular}{ccc}
\includegraphics[width=0.3\linewidth]{fig-median/cluster_14-sescmu0dot30-sesceta0dot30-projection-plot.png}&
\includegraphics[width=0.3\linewidth]{fig-median/cluster_14-sescmu0dot30-sesceta0dot50-projection-plot.png}&
\includegraphics[width=0.3\linewidth]{fig-median/cluster_14-sescmu0dot30-sesceta0dot99-projection-plot.png}\\
%%
\includegraphics[width=0.3\linewidth]{fig-median/cluster_15-sescmu0dot30-sesceta0dot30-projection-plot.png}&
\includegraphics[width=0.3\linewidth]{fig-median/cluster_15-sescmu0dot30-sesceta0dot50-projection-plot.png}&
\includegraphics[width=0.3\linewidth]{fig-median/cluster_15-sescmu0dot30-sesceta0dot99-projection-plot.png}\\
$\eta=0.3$ & $\eta=0.5$ & $\eta=0.99$
\end{tabular}
\end{center}
\caption{{\bf Dataset \protHMM, 
comparison of the center of mass (COM), the projection median (PM), and the SESC center.}
Comparison on two clusters in dimension $d=400$: compact cluster (cluster 14, top row), 
less compact cluster (cluster 15, bottom row).
All data points and the three points of interest projected  onto the first two principal components
yielded by PCA in dimension $d=400$.
For the non compact cluster, increasing $\eta$ triggers the addition of new inliers, whence the shift
of the SESC center. See also Table \ref{tab:ratios}.}
\label{fig:cmp-com-pm-center} 
\end{figure*} 


\subsection{Clustering with Expectation-Maximization}
%%ii-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%-%

\paramini{Rationale.}  To fit a mixture of SESCs, each defined
in a space with its proper intrinsic dimension, we use the SESC model
inside Expectation-Maximization.

\paramini{SESC optimization.}  Assume that the dimension $p$
of the target affine space for cluster $C_\ell$ is known.
%%
For the corresponding point set $D_\ell$, we need to minimize the sum
of distances given by Eq. (\ref{eqn:distance-sesc}). To do so, we use
an iterative strategy which iteratively optimizes the cluster center
(initialized at the center of mass of $D_\ell$), and the affine
subspace.

To find the affine subspace at fixed center $c$, we aim at solving
\begin{equation}
\min_V \sum_{x \in D_\ell} \| (x-c)_{\perp} \|^2
\end{equation}
%%
We proceed as classically done in principal components analysis (PCA)
using an SVD, except that we work with points centered with respect to
$c$.  The vectors associated with the top $p$ singular values are then
returned.  To update the cluster center given the affine space, we use
the algorithm of Section \ref{sec:sesc-opt}.

We also need a stop criterion controlling the evolution of the center
and of the affine space. For the former, we simply use the Euclidean
distance between centers at two successive iterations.  For the
latter, we use a Riemannian metric on the Grassmannian manifold
$Gr(k,d)$, \ie the set of $k$-dimensional subspaces of $\Rd$.  The
calculation requires a SVD and yields a sum of squared {\em principal
  angles} between the two subspaces~\cite{bonnabel2009riemannian}.
Using this criterion, we declare convergence of the cluster when both
the centers and the linear subspaces are  at distance less than
a numerical tolerance $\varepsilon(=10^{-4})$ from each other.

\paramini{Model selection.}
We handle overfitting using the Bayesian Information Criterion (BIC)
to account for model complexity.
%%
Consider a SESC with $n_l$ points, which has $P = d \times
(\dim V + 1)$ parameters in total.  For this cluster, the BIC has a
the simple expression given in Eq. (\ref{eq:bic-final}).  For a
mixture, we add the BIC of the individual clusters.

\paramini{Overall algorithm.} Our final clustering algorithm is a
direct application of EM, using Lloyd iterations to iteratively (i)
fit the cluster model to the data, and (ii) assign each point to the
nearest cluster. 
%%
Given a predefined number of clusters, the initial model is obtained
in two steps. First, we apply \kmeanspp to partition the data set.
(NB: we run \kmeanspp with a novel seeding procedure explained in
Sec. \ref{sec:clustering}.) Second, we perform PCA on each cluster,
retaining as initial affine space that spanned by the principal
directions spanning a pre-defined fraction (0.9 in our experiments) of
the variance.
%%
We also  integrate inside the iterations a model selection scheme to
confirm or update the dimension of each cluster. 
We summarize the algorithm as follows:
\begin{itemizep}
\item Solve the clustering problem with fixed dimensions for the clusters
\begin{itemizep}
\item For each cluster, fit the cluster to the associated data
\item Assign each point to the nearest cluster, by computing the distances of 
Eq. (\ref{eqn:distance-sesc})
\item Repeat until one reaches a pre-defined number of steps or until convergence
\end{itemizep}
\item Select the best models for each cluster using algorithm of SI Sect. \ref{sec:clustering}
\item If the model selection did change some cluster, repeat, otherwise terminate.
\end{itemizep}

\paramini{Simple mixtures.}  We
illustrate our algorithm on simple synthetic mixtures, comparing the
results against those yielded by (affine) sparse subspace clustering
\cite{elhamifar2013sparse,li2018geometric}.  For (A)SSC, we use the
code from
\url{https://github.com/abhinav4192/sparse-subspace-clustering-python},
running the function \codecx{SparseCoefRecovery} with $cst=1$, meaning
that the affine version is used. 

The first example involves three pairs of colinear
segments--Fig. \ref{fig:cmp-UoS}(M1).  Our method recovers the
clusters perfectly, while ASSC is confused by the colinearity.
%%
The second example features a segment piercing a disk~Fig. \ref{fig:cmp-UoS}(M2).
The two clusters are perfectly recovered, whereas ASSC has difficulties
due to the intersections.
%%
The third example consists of two tangent rectangles~Fig. \ref{fig:cmp-UoS}(M3).
Our algorithm provides different outputs, depending on the importance
of the distance to the affine space, determined by the value of $\mu$.
On this example, ASSC faces a leakage issue near the intersection.
%%
Our fourth example involves to intersecting rectangles~Fig. \ref{fig:cmp-UoS}(M4).
Here too, the value of $\mu$ determines whether  or not the two 2D clusters are recovered or not.
On the other hand, ASSC faces difficulties, due to the intersection of clusters.
%%
We note in passing that the correctness / {\em subspace
  detection} property of ASSC only holds under certain sufficient
conditions~\cite{soltanolkotabi2012geometric,li2018geometric}.  On the
above examples, the affine spaces of the clusters are not affinely
independent.  Moreover, for intersecting clusters in general and
extreme points (cluster convex hull vertices) in particular, if the sparse
optimization yields a nonnegative reconstruction of a point of a
cluster using points from other clusters, then, the subspace preserving
property does not hold~\cite[Thm IV.4]{li2018geometric}. 

%(That is the dimension of the affine hull of all points is not equal
%to the sum of dimensions minus the number of clusters minus one.)

\newcommand{\figwclu}{0.4}
\begin{figure*}[htbp]
\begin{center}
\begin{tabular}{c|c}
\includegraphics[width=\figwclu\textwidth]{fig-cmp-UoS/one-montage.pdf} & \includegraphics[width=\figwclu\textwidth]{fig-cmp-UoS/two-montage.pdf}\\
\hline\\
\includegraphics[width=\figwclu\textwidth]{fig-cmp-UoS/nine-montage.pdf} & \includegraphics[width=\figwclu\textwidth]{fig-cmp-UoS/ten-montage.pdf}
\end{tabular}
\end{center}
\caption{\small {\bf Clustering using SESCs vs Affine Sparse Subspace Clustering: comparison on simple mixtures.}
Model M1: six coplanar segments; M2: segment piercing a disk;
M3: square resting on a rectangle; M4: intersecting squares.}
\label{fig:cmp-UoS} 
\end{figure*}

\section{Outlook}
%%i%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\paramini{Contributions.}
Subspace clustering algorithms are powerful unsupervised techniques
providing information on the geometry of clusters, in addition to
identifying them. Affine and spherical clusters, based on two of the
simplest geometric models, have established as models of choice.
However, such cluster models were hampered by two limitations: for
spherical clusters, the question of the uniqueness of the cluster
center; for affine clusters, the dimensionality selection to avoid
overfitting.  
%%
Our work settles both questions, and combines the
solutions to introduce spherical clusters embedded within affine
spaces of the appropriate dimension (SESC).
%%
In addition, our experiments show that the center of a fully
dimensional SESC provides a simple and useful notion of
parameterized high dimensional median.

\paramini{Further work.}
Despite its merits, our work leaves three important open questions.

First, since  our SESC model uses isotropic clusters embedded in affine
spaces,  extending the model to non-isotropic clusters and possibly
curved spaces would be worthwhile.  

Second, a careful complexity analysis of the SESC center calculation
(both exact and approximate with BFGS) would be extremely interesting,
to assess its merits with respect to the projection
median in particular, for which efficient randomized algorithms exist.

Third, encouraging results have been obtained when designing mixtures
of SESC instances, on simple 3D heterogeneous mixtures.  In the spirit
of mixture fitting using split/merge/delete operations on components,
a real challenge is to understand how to combine instances of SESC, so
as to obtain compact and informative mixtures. This question 
goes beyond the SESC model though, as it targets the general question
of complex mixture design.

\begin{comment}
From a practical standpoint, as shown by our preliminary experiments,
varying the number of clusters typically yields a small number of
models to be further analyzed to get insights into the data.  We
anticipate that these models will be of special interest for datasets
with complex mixtures in moderate dimension, and further work will
also consist of comparing the models obtained against Gaussian mixture
models yielded by Expectation-Maximization.

The ability to detect mixtures whose components are of different
dimensionality holds great promises, since generative models
exploiting the correlations associated with the affine spaces
identified may be designed.
\end{comment}








