%\documentclass{uai2022} % for initial submission
 \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
%\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
%\usepackage{booktabs} % commands to create good-looking tables
%\usepackage{tikz} % nice language for creating drawings and diagrams
\input{preamble}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
%\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Clustering a Union of Linear Subspaces via Matrix Factorization and Innovation Search}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors




\author{Mostafa Rahmani \\
Amazon Prime Video, Seattle WA\\
mostrahm@amazon.com}
% Add affiliations after the authors











  
  \begin{document}
\maketitle

\begin{abstract}
  This paper focuses on the Matrix Factorization based Clustering (MFC) method  which is one of the few closed-form algorithms for the subspace clustering algorithm. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and  it is shown that both algorithms can be robust to notable intersections between the span of  clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters.
\end{abstract}



\vspace{-0.1in}
\section{Introduction}
\vspace{-0.1in}
When data points lie in a single linear manifold, conventional techniques such as Principal Component Analysis (PCA)  can be efficiently used to find the underlying low-dimensional structure~\citep{zhang2014novel,lerman2015robust}. 
However, in many applications, the data points may be originating from multiple independent sources and a union of  manifolds can better model the data \citep{vidal2011subspace}.
The subspace clustering problem is defined on how to learn these low dimensional  manifolds when they are linear subspaces \citep{heckel2013robust,elhamifar2013sparse,DBLduallll,rahmani2017innovation,peng2016deep,lu2013correlation,feng2014robust,patel2013latent,wang2013provable,li2021provable,lu2013correlation,wang2016noisy,you2016scalable,ji2017deep,zhang2018cappronet,klys2018learning,peng2016deep, menon2020subspace,jiang2018nonconvex,lipor2021subspace} in a completely unsupervised way. 


%One of the recently proposed methods for subspace clustering was Innovation Pursuit (iPursuit)   \citep{rahmani2017innovation,rahmani2017subspace} and it was shown that it can notably outperform the former approaches when the clusters are close to each other. %In this paper, we focus on analyzing and understanding  iPursuit  and we reveal its connection with a closed-form Matrix Factorization based Clustering (MFC) method. %Moreover, a randomized framework is presented which facilities the application of MFC/iPursuit to datasets with large number of data points.  

%\textcolor{red}{MFC}

\textbf{Summary of contributions:} 
This paper focuses on analyzing two  subspace clustering algorithms:  Matrix Factorization based Clustering (MFC) and Innovation Pursuit (iPursuit).
First we reveal the underlying connection between them and the presented analysis shows why they can notably outperform other spectral clustering based methods in the challenging scenarios. The main contributions of this work can be summarized as follows. 

\noindent
$\bullet$ It is shown that iPursuit is equivalent to MFC  if we alter its $\ell_1$-norm based cost function  into a quadratic cost function and importantly, all the presented theoretical results are applicable to both algorithms. 

\noindent
$\bullet$ 
To the best of our knowledge, this paper presents the first comprehensive analysis of MFC/iPursuit algorithms and the presented analysis is not based on the restrictive innovation assumption used in \citep{rahmani2017innovation,rahmani2017innovationJ}. 
The MFC/iPursuit algorithms are analyzed and we establish  deterministic and probabilistic sufficient conditions which guarantee that the computed adjacency matrix by MFC/iPursuit satisfies a defined quality requirement. Importantly, it is shown that in contrast to most of other clustering algorithms whose performance depend on the distance between the subspaces, the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters. Accordingly, even if the span of clusters intersect heavily, MFC/iPursuit can still provably satisfy the performance requirement. 



%\noindent
%$\bullet$ The presented deterministic result is simplified via assuming  random models for the distribution of data points and the distribution of the subspaces and several  probabilistic guarantees are established. 


% \vspace{-0.1in}
\textbf{Notation and Definitions:}
%\vspace{-0.1in}
Given a matrix $\bA$, $\| \bA \|$ denotes its spectral norm, $\| \bA \|_F$ denotes its Frobenius norm, and $\| \bA \|_{p,1} = \sum_{i} \| \ba_i \|_p$ where $\ba_i$ denotes the $i^{th}$ column of $\bA$ and $\ba^i$ denotes the $i^{th}$ row of $\bA$. For a vector $\ba$, $\| \ba \|_p$ denotes its $\ell_p$-norm, $\ba(i)$ denotes its $i^{\text{th}}$ element, and $\ba[i:k]$  contains the elements of $\ba$ whose indexes are from $i$ to $k$.   The elements of matrix $\bY = |\bX|$ are equal to the absolute value of the elements of matrix $\bX$.
%The function orth$(\cdot)$ % is defined similar to the function orth$(\cdot)$ in MATLAB, which
%returns an orthonormal basis for the column-space of its matrix argument.
The subspace $\calU^{\perp}$ is the complement of $\calU$. $\mathbb{S}^{M_1 - 1}$ indicates the unit $\ell_2$-norm sphere in $\mathbb{R}^{M_1}$. It is assumed that data matrix $\bD \in \mathbb{R}^{M_1 \times M_2}$ can be represented as $\bD = \bU \Sigma \bV^T$ where $\bU \in \mathbb{R}^{M_1 \times r_d}$ is the matrix of left singular vectors, the diagonal matrix $\Sigma \in \mathbb{R}^{r_d \times r_d}$ contains the non-zero singular values,  the columns of $\bV \in \mathbb{R}^{  M_2 \times r_d}$ are equal to the right singular vectors,  $r_d$ is the rank of $\bD$, $M_2$ is the number of data points, and $M_1$ is the dimension of ambient space. The subspace $\calS = \oplus_{i=1}^m \calS_i$ is equal to the direct sum of subspaces $\{\calS_i \}_{i=1}^m$ and $\text{dim}(\calS)$ denotes the dimension of $\calS$.  Two adjacency matrices $\bA \in \mathbb{R}^{M_2\times M_2}$ and $\bB\in \mathbb{R}^{M_2\times M_2}$ are said to be equivalent when $\frac{\ba_i}{\| \ba_i\|_1} = \frac{\bb_i}{\| \bb_i\|_1}$ holds for all $1\le i \le M_2$. RHS means right hand side and LHS means left hand side. 

\textbf{Distance between subspaces:} Suppose $\bU_1 \in \mathbb{R}^{M_1 \times r}$ and $\bU_2 \in \mathbb{R}^{M_1 \times r}$ are orthonormal bases for r-dimensional subspaces $\calS_1$ and $\calS_2$, respectively. Two different notions are used to express the affinity between two subspaces. One measure is $\| \bU_1^T \bU_2 \|$. However, $\| \bU_1^T \bU_2 \|$ is always equal to 1 when $\text{dim}(\calS_1 \cap \calS_2) >0$. The other measure of affinity between two subspaces is $$\| \bU_1^T \bU_2 \|_{\sigma} = \sqrt{\frac{\sum_{i=1}^r \cos^2 \theta_i}{r}}$$ where $\{\theta_i \}_{i=1}^r$ are the principal angles between $\calS_1$ and $\calS_2$ \citep{soltanolkotabi2012geometric}. Note that $\|\bU_1^T \bU_2 \|_{\sigma} =1$ only when $\calS_1 = \calS_2$. %We say that $\calS_1$ and $\calS_2$ are strongly coherent with each other when the value of $\|\bU_1^T \bU_2 \|_{\sigma}$ is close to one.

\vspace{-0.1in}
\subsection{Data Model}
\vspace{-0.1in}
Data Model 1 provides the details of the presumed model along with  definition of the used symbols. To simplify the exposition and the analysis, it is assumed that the dimension of subspaces are equal, the number of data points in different clusters are equal, and a subspace $\calS$ is used to define the intersection between the span of clusters. 

\begin{data model}
The data matrix $\bD \in \mathbb{R}^{M_1 \times M_2}$ can be written as $\bD = [\bD_1 \:,\: \bD_2 \:, ...\:, \bD_m] \bT$
where $\bT\in \mathbb{R}^{M_2 \times M_2}$
is an unknown permutation matrix. We define $\calS_i$ as the column space of $\bD_i$ and  $\calS_i \not\subset \calS_j$ and $\calS_j \not\subset \calS_i$ for any $i \neq j$. 
The dimension of all subspaces is equal to $r$ and there are $n$ data points in each cluster, i.e., $\bD_i \in \mathbb{R}^{M_1 \times n}$. The dimension of the intersection between subspaces is equal to s, i.e., $\text{dim} \left(\cap_{i=1}^m \calS_i\right) = s$ and we define subspace $\calS = \cap_{i=1}^m \calS_i$. In addition, $\calS_i \cap \calS_j=\calS$ for all $i \neq j$. The orthonormal matrix $\bU_i \in \mathbb{R}^{M_1 \times r}$ is a basis for $\calS_i$ and it can be written as $\bU_i = [\bS \:,\: \dot{\bU}_i]$ where orthonormal matrix $\bS \in \mathbb{R}^{M_1 \times s}$
is a basis for $\calS = \cap_{i=1}^m \calS_i$ and $\dU_i \in \mathbb{R}^{M \times (r-s)}$ is a basis for $\calS_i \cap \calS^{\perp}$. The orthonormal matrix $\dU_i $ represents the component of $\calS_i$ which does not lie in $\calS$ and we call $\dot{\calS}_i = \text{span}(\dU_i) = \calS_i \cap \calS^{\perp}$ the innovative component of $\calS_i$.
% In addition, we define $\dot{\calS_{k_i}} = \text{span}(\dU_{k_i})$. 
Each data point $\bd_i$ which lies in $\calS_{k_i}$ can be represented as 
\begin{eqnarray}
\bd_i = \bS \alpha_i + \dU_{k_i} \beta_i \:,
\label{eq:alpha_beta}
\end{eqnarray}
where $\alpha_i \in \mathbb{R}^{s}$ and $\beta_i \in \mathbb{R}^{r-s}$. 
\end{data model}

In order to  represent the association of  each data point to its corresponding cluster, we define index $k_i$ such that $\bd_i \in \calS_{k_i}$. Matrix $\bD_{-k}$ includes all the columns of $\bD$ except the ones which lie in $\calS_k$. Matrices $\dD_j$ and $\bar{\bD}_j$ are defined as  $\dD_j = \dU_j^T \bD_j$ and $\bar{\bD}_j = \bS^T \bD_j$. 


\begin{algorithm}
\caption{Data Clustering Using iPursuit }
{
\textbf{Input.} The input is data matrix $\bD \in \mathbb{R}^{M_1 \times M_2}$.

\smallbreak
\textbf{1. Project data points on $\mathbb{S}^{M - 1}$.}  Set $\bd_i$ equal to $\bd_i / \| \bd_i \|_2$ for all $1 \le i \le M_2$.

\smallbreak
\textbf{2. Direction search.}
Define $\bC^{*} \in \mathbb{R}^{M_1 \times M_2}$   as  optimal point of
%\begin{eqnarray}
$
\underset{ \bC}{\min} \: \:  \| \bC^T \bD \|_{1} \: \: \text{subject to} \: \: \text{diag}(\bC^T \bD) = \textbf{1} \:.
\label{opt:kolli}
%\end{eqnarray}
$
%where all the elements of $\textbf{1} \in \mathbb{R}^{M_2}$ are equal to 1.
\smallbreak

\textbf{3.} Define adjacency matrix $\bA= \big| \bC^T \bD \big|$.

\textbf{4.} Normalize the $\ell_1$-norm of each row of $\bA$ (i.e., normalize degree to 1) and apply  graph preprocessing steps (e.g., sparsifying adjacency matrix $\bA$ via keeping few dominant non-zero elements of each row).

\textbf{5.} Apply spectral clustering  to $\bA + \bA^T$.

\textbf{Output:} The identified clusters.

 }
\end{algorithm}


\vspace{-0.1in}
\section{Related Work}
\vspace{-0.1in}
Numerous approaches for subspace clustering were proposed in prior work including statistical-based approaches \citep{yang2006robust,stat1,stat2,rnc1}, spectral clustering based methods \citep{elhamifar2013sparse,liu2013robust}, the algebraic-geometric approach \citep{vidal2005generalized}, and iterative methods \citep{bradley2000k}.   Much of the recent research work on subspace clustering is focused on spectral clustering \citep{von2007tutorial} based methods \citep{dyer2013greedy,gao2015multi,elhamifar2013sparse,heckel2013robust,liu2013robust,rahmani2017subspace,soltanolkotabi2012geometric,wang2013provable,chen2009spectral,park2014greedy}.


The spectral clustering based algorithms are composed of two main steps and they only differ in the first step. First, an adjacency matrix is constructed via finding a neighborhood set for each data point and in the second step, the spectral graph clustering algorithm \citep{von2007tutorial} is applied to the learned adjacency matrix. 
For instance, Sparse Subspace Clustering (SSC)~\citep{elhamifar2013sparse} uses $\ell_1$-minimization to construct a  sparse adjacency matrix, Low-Rank Representation (LRR) \citep{liu2013robust} uses nuclear norm minimization to find the adjacency matrix, and 
the Thresholding based Subspace Clustering (TSC) method \citep{heckel2013robust} simply uses the inner-product between the data points to construct the adjacency matrix. In contrast to TSC which uses  inner-product between the data points to construct the adjacency matrix,  iPursuit    \citep{rahmani2017innovation,rahmani2017subspace} utilized the directions of  innovation to measure the similarity between the data points. 
The Matrix Factorization based Clustering (MFC) method \citep{kanatani2001motion,costeira1998multibody,boult1991factorization}  is a closed-from spectral clustering based method which utilizes the right singular vectors of the data to construct the adjacency matrix. 










\begin{algorithm}
\caption{Matrix Factorization based Clustering (MFC)}
{
\textbf{Input.} The input is data matrix $\bD \in \mathbb{R}^{M_1 \times M_2}$.

\smallbreak
\textbf{1. Project data points on $\mathbb{S}^{M - 1}$.}  Set $\bd_i$ equal to $\bd_i / \| \bd_i \|_2$ for all $1 \le i \le M_2$.

\smallbreak
\textbf{2. SVD:} Compute $\bD = \bU \mathbf{\Sigma}\bV^T$ where the columns of $\bV \in \mathbb{R}^{  M_2 \times r_d}$ are equal to the right singular vectors.% and $r_d$ is the rank of $\bD$.


\textbf{3.} Define $\bA = \big| \bV \bV^T|$.

\textbf{4.} Similar to Step 4 in Algorithm 1. 

\textbf{5.} Similar to Step 5 in Algorithm 1. 

\textbf{Output:} The identified clusters.

 }
\end{algorithm}




\vspace{-0.1in}
\subsection{A Brief Overview of iPursuit (Algorithm 1)}
\vspace{-0.1in}
Suppose that data matrix $\bD$ follows Data Model 1. If the span of clusters satisfy Assumption \ref{asm:innnov}, then we say that  Innovation Assumption holds. 
\begin{assumption}
For each subspace $\calS_i$, we have $\calS_{i} \notin \underset{k\neq i}{\oplus} \calS_k$. 
\label{asm:innnov}
\end{assumption}
Define orthonormal matrix $\bP_i$ such that the column-space of $\bP$ is equal to $\calP_i = \underset{k\neq i}{\oplus} \calS_k$.
If the innovation assumption holds, then the rank of $(\bI - \bP_i \bP_i^T) \bU_i$ is greater than zero and we define $\Vec{\calS}_i$ as the column-space of $(\bI - \bP_i \bP_i^T) \bU_i$. The 
geometrical idea behind
 iPursuit  is that if we can find a direction in $\Vec{\calS}_i$, it is orthogonal to all the clusters except $\calS_i$ and this fact can be used to distinguish $\calS_i$ from the rest of clusters. Specifically, in order to find a direction in $\Vec{\calS}_{k_i}$ corresponding to each $\bd_i$, \citep{rahmani2017innovation,rahmani2017subspace} proposed to find this direction (dubbed the direction of innovation corresponding to $\bd_i$) as the optimal point of
\begin{eqnarray}
\underset{ \bc}{\min} \: \:  \| \bc^T \bD \|_1 \quad \text{subject to} \qquad \bc^T \bd_i = 1 \:.
\label{eq:intro_ip_1}
\end{eqnarray}
The motivation behind the design of (\ref{eq:intro_ip_1}) was that
 the direction of innovation corresponding to $\bd_i$  can be computed via looking for a vector which is orthogonal to the maximum number of data points. % while it has a non-zero inner-product value with $\bd_i$ and the $\ell_1$-norm in (\ref{eq:intro_ip_1}) was used as the best convex approximation to the $\ell_0$-norm. 
Although the innovation assumption was used to design iPursuit, in \citep{rahmani2017subspace,rahmani2017innovation} it was numerically shown that it is not essential in the performance of iPursuit.

The authors of \citep{rahmani2017subspace,rahmani2017innovation} presented an analysis of (\ref{eq:intro_ip_1}) which is limited to a two cluster scenario and it was based on the Innovation Assumption to prove that the optimal point of (\ref{eq:intro_ip_1}) lies in $\Vec{\calS}_{k_i}$. In  contrast, the presented theoretical study (a) does not require
the innovation assumption, (b)  guarantees a completely different requirement, (c)  is the first thorough analysis of
MFC, (d) reveals the connection between iPursuit and MFC, and importantly (e) it  shows the importance
of the incoherence between the innovative components.

\vspace{-0.1in}
\section{Analyzing A Spectral Clustering based   Method}
\vspace{-0.1in}
The difference between different spectral clustering based  algorithms is in the way that they compute the adjacency matrix.  
Accordingly, we should define proper metrics using which we could determine how accurate/useful is the estimated adjacency matrix. 
The authors of \citep{soltanolkotabi2012geometric} used the number of false connections (any non-zero connection between two nodes/data-points while they belong to different clusters) as a metric to assess the estimated adjacency matrix. However, the graph clustering algorithms such as spectral clustering can yield an exact clustering of the data even if there are a significant amount of false connections in the estimated adjacency matrix provided that the estimated weights on the true connections are sufficiently stronger than the weights of the false connections. %In addition, some of the subspace clustering algorithms such as iPursuit , LRR \citep{liu2013robust}, and MFS,  do not estimate a sparse adjacency matrix as \citep{soltanolkotabi2012geometric} and one cannot use number of false connections to assess their estimated adjacency matrix. 
Therefore, in this paper,
we use the following criteria to assess the quality of a adjacency matrix and we analyze the subspace clustering algorithms to reveal if/how they satisfy Requirement 1. 

\begin{requirement}
Suppose $\bA \in \mathbb{R}^{M_2 \times M_2}$ is 
the estimated adjacency matrix.
We require all the columns of $\bA$ to satisfy
$$
 \frac{\kappa}{m-1} \| {\ba_i}_{\calI_i^{\perp}} \|_p^p < \| {\ba_i}_{\calI_i} \|_p^p  \:,  
$$
 where  $\calI_i = \{j \:\:\: | \:\: \: k_i = k_j \}$,   $\calI_i^{\perp} = \{j \:\:\: | \:\: \: k_i \neq k_j \}$,  ${k_i} = \argmax_{j} \| \bU_j^T \bd_i \|_2$, and ${\ba_i}_{\calI_i}$ contains the elements of $\ba_i$ whose indexes are in $\calI_i$.  \label{asm:sufficeint}
\end{requirement}



\noindent
The parameter $\kappa$ is chosen greater than 1 and it determines how well the  adjacency matrix represents the clustering structure of the data. Evidently, the higher is $\kappa$, the more challenging it is for a subspace clustering algorithm to satisfy Requirement \ref{asm:sufficeint}. In the following sections, we discuss the role of parameter $p$ and we analyze MFC/iPursuit such that they satisfy Requirement 1 with $p=1$/$p=2$.  

\begin{remark}
Even if $\bA$  satisfies Requirement 1 with a large $\kappa$, it does not necessarily mean that Spectral Clustering yields exact clustering. Similarly, proving that $\bA$ does not contain any false connection (as  in \citep{soltanolkotabi2012geometric}) also does not guarantee  exact clustering. However, these measures are useful to assess how clear the estimated $\bA$ represents the clustering structure. In addition, although Requirement 1 does not guarantee exact clustering by the spectral clustering step, it is very similar to the sufficient condition stated in \citep{ling2020certifying} to guarantee that the spectral clustering algorithm yields the exact clustering. 
Specifically, \citep{ling2020certifying} proves that if 
$$
\max_{i} \| {\ba_i}_{\calI_i^{\perp}} \|_1 < \frac{\min_{k} \gamma_{2}(\calL(A_k))}{4}\:,
$$
then the spectral clustering algorithm studied in \citep{ling2020certifying} yields  an exact clustering where $\gamma_{2}(\calL(A_k))$ is the second smallest eigenvalue of graph Laplacian w.r.t. the $k^{th}$ cluster and $\bA_k \in \mathbb{R}^{n \times n}$.

%We discuss this matter further in the appendix. 
\end{remark}



\vspace{-0.1in}
\section{Theoretical Studies}
\vspace{-0.1in}
This section focuses on analyzing  MFC/iPursuit  and revealing the key factors in its performance. First, we discuss the  underlying  connection between iPursuit and MFC
and this interesting connection is utilized to analyze both algorithms using similar techniques. We refer the reader to \citep{rahmani2022provable} for the proofs of all the presented results. 
 In the following sections, we  utilize the parameters defined bellow.
 \begin{definition}
Suppose $\bD$ follows Data Model 1. We define
$\Delta_{\min} = \min_{j} \{ \underset{\|\bu\| = 1 \atop \bu \in \calS_{j}}{\inf} \:\| \bu^T \bD_{j} \|_p^p \}_{j=1}^{m}$, $
 \dot{\Delta}_{\max} =  \max_{j} \{ \underset{\|\bu\| = 1 \atop \bu \in \mathbb{R}^{r-s}}{\sup} \|  \bu^T \dD_j \|_p^p \}_{j=1}^m$, 
$ \bar{\Delta}_{\max} = \max_{j} \{ \underset{\|\bu\| = 1 \atop \bu \in \mathbb{R}^{s}}{\sup} \|  \bu^T \bar{\bD}_j \|_p^p \}_{i=1}^m$ ,and $ \phi = \max_{j \neq t} \|\dU_t^T \dU_j \|$. 
In addition, when $y > x$, we define ${\sigma_{l}}(\frac{x}{y}, \delta) = \frac{x - 2\sqrt{x \log\frac{2M_2}{\delta} }}{y + 2\sqrt{ (y-x) \log \frac{2M_2}{\delta}} + 2 \log\frac{2M_2}{\delta} -  2\sqrt{x \log\frac{2M_2}{\delta} }} $ and ${\sigma_{u}}(\frac{x}{y}, \delta) = \frac{x + 2\sqrt{x \log\frac{2M_2}{\delta} } + 2 \log\frac{2M_2}{\delta}}{y + 2\sqrt{ x \log \frac{2M_2}{\delta}} + 2 \log\frac{2M_2}{\delta} -  2\sqrt{(y-x) \log\frac{2M_2}{\delta} }} $.
\end{definition}


The parameters $\Delta_{\min}$, $\dot{\Delta}_{\max}$, and $\bar{\Delta}_{\max}$ are similar to permeance statistic \citep{lerman2015robust} which indicates how well the data points are distributed inside the subspaces. 
For instance, when the columns of $\bD_i$ in $\calS_i$ 
 are concentrated around a direction,
 the value of 
 $\underset{\|\bu\| = 1 \atop \bu \in \calS_{i}}{\inf} \:\| \bu^T \bD_{i} \|_p^p$ is small in comparison to when the data points are uniformly distributed in $\calS_i$. 
Although the permeance statistic appears in the presented results, it does not necessarily mean that
 iPursuit  and MFC  require a uniform distribution of data pints inside the subspaces and the reason that it appears  is that the sufficient conditions guarantee the performance under the worst case scenarios.  The parameter $\phi$ indicates how close the innovative components $\{\dot{\calS}_i \}_{i=1}^m$ are to each other. %In addition, when $x$ and $y$ are sufficiently large, the value of $\sigma_{l}(\frac{x}{y}, \delta)$ and $\sigma_{u}(\frac{x}{y}, \delta)$ are close to $\frac{x}{y}$.
  \begin{remark}
 It is important to note that $\phi$ only measures the affinity between the innovative components  $\{\dot{\calS}_i \}_{i=1}^m$. In other word, even if two subspaces $\calS_i$ and $\calS_j$ heavily intersect such that $\|\bU_i^T \bU_j \|_{\sigma}$ is nearly equal to 1, $\|\dU_i^T \dU_j \|$   could be small if the innovative components are incoherent with each other. In the following results, it is shown that in contrast to most of subspace segmentation methods whose performance depend on $\max_{j \neq t} \|\bU_t^T \bU_j \|_{\sigma}$, the performance of iPursuit and MFC mainly depends on the distance between the innovative components.
 \end{remark}


\vspace{-0.1in}
\subsection{The  Connection Between iPursuit and MFC}
\vspace{-0.1in}
The cost function of iPursuit (\ref{eq:intro_ip_1}) encourages the optimal direction $\bc_i^{*}$ to be orthogonal to the maximum number of data points.
% and in \citep{rahmani2017innovation,rahmani2017subspace}, it was shown that when $\calS_{k_i}$ carries an innovation with respect to $\underset{j \neq k_i }{\oplus} \calS_j$ (i.e., $\calS_{k_i} \neq \oplus_{j \neq k_i} \calS_j$), then the optimal point of (\ref{eq:main_ip}) is orthogonal to all the data points in $\bD_{-k_{i}}$ given that some sufficient conditions are satisfied.
If the innovation assumption (Assumption \ref{asm:innnov}) holds and $\bc_i^{*} \in \vec{\calS_{k_i}}$ for all the data points, then $\bA = | \bD^T \bC^{*} |$ 
does not include any false connection. 
 However, in practice the innovation assumption is not essential and $\bA = |\bD^T \bC^{*}|$ can  yield an accurate clustering of the data  even if $| \bD^T \bC^{*}|$ is not a sparse matrix \citep{rahmani2017subspace,ling2020certifying}. 
 A direct conclusion is that it may not be essential to employ $\ell_1$-norm in the cost function of (\ref{eq:intro_ip_1}). Accordingly, in this section, we investigate an iPursuit algorithm whose $i^{th}$ optimal direction is obtained as the optimal point of
\begin{eqnarray}
\underset{ \bc}{\min} \: \:  \| \bc^T \bD \|_2 \quad \text{subject to} \qquad \bc^T \bd_i = 1 \:.
\label{eq:ell2}
\end{eqnarray}
The following lemma shows that the  iPursuit algorithm which employs $\ell_2$-norm to compute the optimal directions is equivalent to MFC. 
\begin{lemma}
Define $\bC^{*}$ as the optimal point of
$$
\underset{ \bC}{\min} \: \:  \| \bD^T \bC \|_{2,1} \quad \text{subject to} \qquad \text{diag}(\bC^T \bD) = \textbf{1} \:,
$$
and define $\bA = \left| \bD^T \bC^{*}  \right|$. Then
$
\bA(i,j) = \frac{|{\bv^i}^T \bv^j|}{\| \bv^i \|_2^2} \:.
$
%where $\bV \in \mathbb{R}^{ M_2 \times r_d}$ is the matrix of right singular vectors of $\bD$.
\label{lm:sade}
\end{lemma}
\noindent
Lemma \ref{lm:sade} shows that  iPursuit is equivalent to MFC when $\ell_2$-norm is employed to compute the optimal vectors (note that the denominator of $
\bA(i,j) = \frac{|{\bv^i}^T \bv^j|}{\| \bv^i \|_2^2} \:
$ is the same for all the entries of a row and Step 4 of Algorithm 2 normalizes the degree of the nodes and keeps the dominant entries of each row). 
We leverage this connection between MFC and iPursuit to provide an analysis which is applicable to both algorithms. In the following theoretical results, $p$ appears as a parameter in the sufficient conditions. If $p=1$,  the sufficient condition  corresponds to iPursuit and if $p=2$, then the sufficient condition corresponds to MFC. 


%\begin{remark}
%This paper focuses on studying the underlying connection between iPursuit and MFC. Conducting theoretical comparison between iPursuit and MFC is beyond the scope of this paper, i.e., we do not provide an analysis which shows which algorithm to be chosen in a given scenario. 
%\end{remark}

\vspace{-0.1in}
\subsection{An Analysis for MFC and iPursuit}
\vspace{-0.1in}
\label{sec:analysis1}
 The following theorem provides a sufficient condition to guarantee that Requirement 1 is satisfied. The presented results are applicable to both iPursuit and MFC since it is assumed that $\bA = |\bD^{T} \bC^{*}|$ where 
  $\bC^{*}$ is obtained via solving 
\begin{eqnarray}
\underset{ \bC}{\min} \: \:  \| \bD^T \bC \|_{p,1} \quad \text{subject to} \qquad \text{diag}(\bC^T \bD) = \textbf{1} \:.
\label{eq:main_with_p}
\end{eqnarray}
%and the adjacency matrix is formed as $\bA = |\bD^{T} \bC^{*}|$. 
%In the rest of paper, $p$ is equal to 1 or 2. When $p=1$, the results correspond to iPursuit and when $p=2$, they correspond to MFC. 

\begin{theorem}
Suppose that $\bD$ follows Data Model 1 and  $\bA = |\bD^{T} \bC^{*}|$ where $\bC^{*}$ is the optimal point of (\ref{eq:main_with_p}). If
 \begin{eqnarray}
  \begin{aligned}
 & \min_{i} \frac{\| \beta_i \|_2^p}{\| \bd_i \|_2^p} \: \Delta_{\min} \ge \\
 & {\dot{\Delta}_{\max} \: \kappa \left( \frac{1}{\kappa + (m-1)} +  \frac{m-1}{\kappa + (m-1)} \phi^p \right) } \:,
  \end{aligned}
\label{eq:sfip}
\end{eqnarray}
then $\bA$ satisfies  Requirement 1. 
\label{thm: 1}
\end{theorem}

In  contrast to former theoretical results which require $\max_{j \neq t} \|\bU_t^T \bU_j \|_{\sigma}$ to be sufficiently small, the presented guarantee is concerned with $\max_{j \neq t} \|\dU_t^T \dU_j \|$ and note that $\max_{j \neq t} \|\dU_t^T \dU_j \|$ can stay small even if the subspaces have a high dimension of intersection   (i.e., $\|\bU_t^T \bU_j \|_{\sigma}$ is nearly equal to 1). When $m$, the number of clusters, is large, the sufficient condition can be roughly simplified into 
$
\phi^p \le \frac{\Delta_{\min} }{\kappa\: \dot{\Delta}_{\max}} \: \min_{i} \frac{\| \beta_i \|_2^p}{\| \bd_i \|_2^p}\:,
$
which means that the higher is the dimension of intersection, the more distanced the innovative components should be. The sufficient condition requires all the data points to have a sufficiently strong projection on the innovative component. %Moreover, Theorem \ref{thm: 1} suggests that if increasing $m$ does not increase $\phi$ and $\bD$ follows Data Model 1, increasing $m$ can help the algorithms to satisfy Requirement 1. Although it might sound counter intuitive, it is a correct prediction and we discuss this fact further in Section \ref{sec:with_innov} and Section \ref{sec:exp_m}. 

\vspace{-0.1in}
\subsection{Probabilistic Guarantees}
\vspace{-0.1in}
In this section, we simplify the result presented in Theorem \ref{thm: 1} in two steps.
First, we presume a random model for the distribution of the data points
and in the second step, we consider a random model for the generation of the subspaces. 
%to establish a new  condition which depends only on $r$, $s$, $m$, and $\phi$. Next, we assume a random model for the generation of the subspaces and 
We start with the first step as follows. 
\begin{assumption}
Each matrix $\bD_i \in \mathbb{R}^{M_1 \times n}$ is generated as $\bD_i = \bU_i \bG_i$ where the elements of $\bG_i \in \mathbb{R}^{r \times n}$ are sampled independently from $\calN(0, \frac{1}{\sqrt{r}})$. 
\label{asm:data_random}
\end{assumption}

\noindent
Assumption \ref{asm:data_random} ensures
that the distribution of $\frac{\bd_i}{\| \bd_i \|_2}$ is 
uniformly at random on $\mathbb{S}^{M_1 - 1} \cap	\calS_{k_i}$.
Note that $\mathbb{E}[\| \bd_i \|_2^2] = 1$ and in the following theorems, we do not normalize the $\ell_2$-norm of the data points to make the analysis easier.  In this section, we derive the guarantees for $p=2$ and similar guarantees for $p=1$ can be established. 

\begin{theorem}
Suppose $\bD$ follows Data Model 1, matrices $\{ \bD_i \}_{i=1}^{m}$ are generated as in Assumption \ref{asm:data_random}, and adjacency matrix $\bA$ is computed as in Theorem \ref{thm: 1} with $p=2$. If
 \begin{eqnarray}
  \begin{aligned}
& (\frac{n}{r} - {\eta_{\delta}}_r ) \sigma_{l}\left( \frac{r-s}{r}, \delta \right) \ge \\
& \quad\quad \kappa (\frac{1}{\kappa+ m-1} + \phi^2 \frac{m-1}{\kappa +m -1}) \left( \frac{n}{r} + \frac{r-s}{r}{\eta_{\delta}}_{r-s} \right)
   \end{aligned}
 \end{eqnarray}
where ${\eta_{\delta}}_{x} = \max ( \frac{4 {z_{\delta}}_x }{3} \log \frac{2\:x\:m}{\delta} , \sqrt{4 \frac{n(x+3)}{x^2} \log \frac{2 x m}{\delta}} )$ and \textcolor{black}{${z_{\delta}}_x = 1 + 2 \sqrt{\frac{1}{x} \log\frac{2n m}{\delta} } + \frac{2}{x} \log\frac{2 n m}{\delta}$}, then  Requirement \ref{asm:sufficeint} with $p=2$ is satisfied with probability at least $1 - 5\delta$.
\label{thm:random_point}
\end{theorem}

Theorem \ref{thm:random_point} reveals several interesting points about the requirements of the algorithms. First it confirms our intuition about the relation between the dimension of subspaces and the required number of data points. The sufficient condition states that $n/r$ should be sufficiently large to ensure that  Requirement 1 is satisfied. When $n/r$ is sufficiently large, then $(\frac{n}{r} - {\eta_{\delta}}_r)$ is nearly equal to $n/r$. Therefore, when $m$ is large, the sufficient condition roughly states that $\phi^2$ should be sufficiently smaller than $\frac{1}{\kappa}\frac{r-s}{r}$. In other word, Theorem \ref{thm:random_point} clearly indicates that the higher is the dimension of intersection, the more separable their innovative components should be. Next, we  further simplify  the sufficient condition  via assuming a random model for the distribution of subspaces.  
\begin{theorem}
Suppose $\bD$ and $\bA$ are generated as in Theorem \ref{thm:random_point} and    $\{\dot{\calS}_i \}_{i=1}^m$ and $\calS$ are chosen independently and uniformly at random. If
 \begin{eqnarray*}
  \begin{aligned}
& \left(\frac{n}{r} - {\eta_{\delta}}_r \right) \sigma_{l}\left( \frac{r-s}{r}, \delta \right) \ge\\
 & \kappa \left( \frac{n}{r} + \frac{r-s}{r}{\eta_{\delta}}_{r-s} \right)  \left(\frac{1}{\kappa+ m-1} + \frac{c_{{\delta}} (r-s)^2 }{M_1} \frac{m-1}{ \kappa + m -1} \right) 
   \end{aligned}
 \end{eqnarray*}
then Requirement \ref{asm:sufficeint} is satisfied with probability at least $1 - 6\delta$, where $c_{\delta} = 3 \max \left(1 ,  \sqrt{\frac{8  M_1 \pi }{(M_1 - 1)(r-s)}} , \sqrt{\frac{16 M_1 \log \frac{m r}{\delta} }{(M_1 - 1)(r-s) }} \right)$.% ,  and ${\eta_{\delta}}_x$ was defined in Theorem \ref{thm:random_point}.
\label{thm:full_random}
\end{theorem}
If we simplify the sufficient condition, Theorem \ref{thm:full_random} roughly states that  $M_1$ should be sufficiently larger than $\kappa r (r-s) \sqrt{\log m}$. The main reason is  that the subspaces and their innovative components are generated uniformly at random and the higher is the dimension of the ambient space, the less coherent they are in expectation. % \textcolor{blue}{ Specifically, in the proof of Theorem \ref{thm:full_random}, it is shown that an upper-bound for the value of $\phi^2$ scales linearly with $\frac{(r-s)^2}{M_1}$ which shows that when $M_1$ increases, we can expect a lower value for $\phi$ (when the subspaces are generated randomly).}



\begin{remark}
The main purpose of the presented analysis is to demonstrate the key performance factors of the MFC/iPursuit algorithms and to show why they are notably robust to the strong intersection between the span of clusters. If we want to go further and use the theoretical results to compare MFC/iPursuit against the other subspace clustering algorithms, we need to analyze the other methods using the utilized criteria (Requirement 1). Although it goes beyond the scope of this paper, Section \ref{sec:compTSC} presents a full analysis of the TSC algorithm based on Requirement 1 to show why MFC can strongly outperform TSC while their computation complexities are not much different. 
\end{remark}

\vspace{-0.1in}
\subsection{With the Innovation Assumption}
\vspace{-0.1in}
\label{sec:with_innov}
The innovation assumption (Assumption \ref{asm:innnov}) is not essential in the performance of MFC/iPursuit and we did not use it in any of the presented studies. However, the innovation assumption can be utilized to establish \textcolor{black}{stronger} guarantees. 
In this section,  two theorems are presented whose only
difference with Theorem \ref{thm: 1} and Theorem \ref{thm:full_random} is that they assume that Assumption \ref{asm:with_inv} (stated bellow) holds. 
\begin{assumption}
It is assumed that $\bD$ follows Data Model 1 and  $\text{dim}(\dot{\calS_i} \cap	 \calP_i) = 0$ where $\calP_i =  \oplus_{k\neq i} \calS_i$.
\label{asm:with_inv}
\end{assumption}
Assumption \ref{asm:with_inv}
ensures that each innovative component  $\dot{\calS}_i$ is independent from the direct sum of all the other subspaces. The following theorem presumes that Assumption \ref{asm:with_inv} holds.
\begin{theorem}
\noindent
Suppose $\bD$ follows Assumption \ref{asm:with_inv},  define $\Vec{\calS_i}$ as the column space of $(\bI - \bP_i \bP_i^T) \bU_i$, define $\vec{\bU}_i$ as a basis for $\Vec{\calS_i}$, and assume $\bA = |\bD^{T} \bC^{*}|$ where $\bC^{*}$ is the optimal point of (\ref{eq:main_with_p}). If 
\begin{eqnarray}
\begin{aligned}
\min_{i}  \frac{\| \beta_i \|_2^p}{\| \bd_i \|_2^p}  \:\min_{i} \| \vec{\bU}_{k_i}^T \dot{\bU}_{k_{i}} \|_{m}^p \ge \frac{\kappa}{\kappa +m-1}   
\frac{\dot{\Delta}_{\max} }{\Delta_{\min}}\:,
  \end{aligned}
  \label{eq:withinnovsuff}
\end{eqnarray}
then  Requirement 1 is satisfied
where $\| \vec{\bU}_{k_i}^T \dot{\bU}_{k_{i}} \|_{m}$ is the minimum singular value of $\vec{\bU}_{k_i}^T \dot{\bU}_{k_{i}}$.
\label{thm:with_inv}
\end{theorem}

The subspace $\Vec{\calS_i}$ was defined as the projection of $\calS_i$ onto $(\oplus_{k \neq i} \calS_k)^{\perp}$ which is equivalent to the projection of $\dot{\calS}_i$ onto $(\oplus_{k \neq i} \calS_k)^{\perp}$. The closer is $\dot{\calS}_i$ to $\Vec{\calS_i}$, the more incoherent is $\dot{\calS_i}$ with the innovative component of the other clusters
since $\vec{\calS_i}$ is orthogonal to $\oplus_{j\neq i} \dot{\calS_j}$. 
 This is the reason we have $\| \vec{\bU}_{k_i}^T \dot{\bU}_{k_{i}} \|_{m}$ on the LHS of (\ref{eq:withinnovsuff}) because $$\| \vec{\bU}_{k_i}^T \dot{\bU}_{k_{i}} \|_{m} = \underset{\|u \|_2 =1}{\min} \| \vec{\bU}_{k_i}^T \dot{\bU}_{k_{i}} \bu \|_2$$ is a measure of coherence between $\dot{\calS}_i$ and $\vec{\calS}_i$. 
 Therefore, similar to Theorem \ref{thm: 1}, Theorem \ref{thm:with_inv} states that the weaker is the projection of data points onto the innovative components, the more distanced the innovative components should be. The major difference between the condition of  Theorem \ref{thm: 1} and that of Theorem \ref{thm:with_inv} is that in (\ref{eq:withinnovsuff}) $m$ plays a stronger role and (\ref{eq:withinnovsuff}) states that increasing $m$ (provided that it does not increase the coherency between the innovative components) can enhance the chance of MFC/iPursuit to satisfy Requirement 1. The following theorem \textcolor{black}{provides a more explicit sufficient} condition via assuming the random data model used in Theorem \ref{thm:full_random}.
 
\begin{theorem}
Suppose $\bD$ and $\bA$ are generated as in Theorem \ref{thm:full_random} and assume that $M_1 > s + (r-s)m$. 
If 
 \begin{eqnarray}
  \begin{aligned}
& \sigma_{l} \left( \frac{r-s}{r} ,\delta  \right) \sigma _{l}\left( \frac{\vartheta}{M_1} , \delta \right) \ge \\
& \quad\quad\quad \frac{\kappa}{\kappa + m - 1} \frac{ \frac{n}{r} + \frac{r-s}{r} {\eta_{\delta}}_{r-s}}{\frac{n}{r} - {\eta_{\delta}}_r }
   \end{aligned}
 \end{eqnarray}
where $\vartheta = M_1 - \big(s + (r-s)(m-1)\big)$, then Requirement 1 with $p=2$ is satisfied with probability at least $1 - 6\delta - \epsilon$ where $\epsilon$ is the probability that the rank of $\bD$ is less $s + (r-s)m$.
\label{thm:with_inv_random}
\end{theorem}
Note that Theorem \ref{thm:with_inv_random} does not need to explicitly presume that Assumption \ref{asm:with_inv} holds because when $M_1 > s + (r-s)m$, Assumption \ref{asm:with_inv} is satisfied with an overwhelming probability \citep{vershynin2010introduction}.  
The sufficient condition roughly states that when $n/r$ is  large enough, then $\frac{r-s}{s} \: \frac{\vartheta}{M_1}$ 
should be sufficiently larger than $\frac{\kappa}{\kappa + m}$ to guarantee that the requirement is satisfied with high probability.
The value of $\frac{\vartheta}{M_1}$ increases when $M_1$ increases and it converges  to 1 when $r_d/M_1$ decreases.
%(the rank of $\bD$ in Theorem \ref{thm:with_inv_random} is equal to $s + m(r-s)$ with an overwhelming probability). 
%The main reason is that as $M_1$ increases, $\mathbb{E} [\max_{i \neq j} \|\dU_i \dU_j\|]$ decreases and the innovative components are less coherent with each other since they are distributed in a larger ambient space. 
%Note that the rank of $\bD$ in Theorem \ref{thm:with_inv_random} is equal to $s + m(r-s)$ with an overwhelming probability []. Therefore, when $\bD$ is a low rank matrix ($M_1$ is sufficiently large), $\frac{\vartheta}{M_1}$ is nearly equal to one  which means that as $M_1$ increases, the distance between $\dot{\calS}_i$ and $\vec{\calS}_i$ decreases in expedition. The main reason is that as $M_1$ increases, $\mathbb{E} [\max_{i \neq j} \|\dU_i \dU_j\|]$ decreases and the innovative components are less coherent with each other.  
%Therefore, if $M_1/r_d$ is sufficiently large, $n/r$ is large enough, and $\bD$ is generated as in Theorem \ref{thm:with_inv_random}, then $\frac{r-s}{r}$ should be sufficiently larger than $\frac{\kappa}{\kappa + m}$ to ensure that Requirement 1 is satisfied with high probability. 


%Therefore, the high is $M_1$ the more likely is to satisfy the sufficient condition and the reason is that as $M_1$ increases, the coherency between the innovative component of the subspaces decreases (i.e., the coherency between each $\dot{\calS}_i$ and its corresponding $\vec{\calS}_i$ increases). 

Theorem \ref{thm: 1}, Theorem \ref{thm:with_inv}, and Theorem \ref{thm:with_inv_random} indicate that if $\bD$ follows Data Model 1, then the larger is the number of clusters, the more likely it is for MFC/iPursuit to satisfy Requirement 1 provided that increasing $m$ does not increase the coherency between $\{\dot{\calS}_i \}_{i=1}^m$. This fact might sound counter intuitive, but it is an accurate prediction. For instance, suppose that $\bD$ is generated as in Theorem \ref{thm:with_inv_random}, the first $n$ columns of $\bD$
lie in $\calS_1$, $n=200$, $r=10$, $s=8$, and $M_1=400$. Define 
$$
\ba_{\calS_1} = \frac{1}{n}\sum_{i=1}^{n} \ba_i $$
where $\ba_i$ is the $i^{th}$ column of $\bA$. Therefore, $\ba_{\calS_1}$ is the average of the first $n$ columns of $\bA$ which are corresponding to data points in $\calS_1$. 
Figure \ref{fig:versus_m_matn} shows $\ba_{\calS_1}$ with different values of $m$ for the adjacency matrices computed by MFC and the TSC algorithm \citep{heckel2013robust} which computes $\bA = |\bD^ T \bD|$. 
Ideally, we should observe that the expected value of the elements of $\ba_{\calS_1}[1:n]$ are  sufficiently larger than the expected value of the elements of $\ba_{\calS_1}[n:M_2]$. 
One can observe that when $\bA = | \bD^T \bD |$, 
the elements of $\ba_{\calS_1}[1:n]$ are  not much distinguishable from the elements of $\ba_{\calS_1}[n:M_2]$ with both $m=2$ and $m=10$.
 In  contrast, when $\bA = | \bD^T \bC^{*} |$ and when $m=10$, $\| \ba_{\calS_1}[1:n] \|_2$ is clearly larger than $\frac{1}{m-1} \| \ba_{\calS_1}[n:M_2]\|_2$. The last plot of Figure \ref{fig:versus_m_matn} shows the effect of $m$ on the quality of the computed adjacency matrix in a more clear way. Define 
parameter $\hat{\kappa}$ as follows
\begin{eqnarray}
\begin{aligned}
{\kappa}^{'} = \frac{(m-1)\: \left\| \ba_{\calS_1}[1:n] \right\|_2^2 }{\left\| \ba_{\calS_1}[n : M_2] \right\|_2^2 } \:.
\end{aligned}
\label{eq:define_k_pr}
\end{eqnarray}
Parameter $\kappa^{'}$ shows how clear the adjacency matrix separates the data points in $\calS_1$ from the other clusters. 
The last plot (first from right), shows ${\kappa}^{'} $ versus $m$ for both MFC and TSC. One can observe that ${\kappa}^{'} $ notably increases as $m$ increases when $\bA = |\bD^T \bC^{*}|$ 
which means that the quality of the estimated adjacency matrix  improves as $m$ increases. In sharp contrast, increasing $m$ does not show a positive/negative impact on the computed adjacency matrix by Algorithm 3.  
%The main reason that $\bD^T \bC^{*}$ yields a better adjacency matrix (in terms of the metric defined in Requirement 1) is that as the number of clusters increases, 

It is important to note that the conclusion that the performance of MFC/iPursuit  improves if $m$ increases is not a general rule. When $M_1$ is not sufficiently large, as $m$ increases, the distance between the subspaces (and the distance between their innovative components) decreases and it degrades the performance of the algorithms.  
Moreover,  the reason that in Theorem \ref{thm:full_random} and Theorem \ref{thm:with_inv_random} the coherency between the subspaces decreases as $M_1$ increases is due to the presumed model for the generation of the subspaces and it is not a general rule that $\phi$  decreases as $M_1$ increases. 



\begin{algorithm}
\caption{Inner-Product based Subspace Clustering \citep{heckel2013robust} (TSC Algorithm)}
{%\footnotesize
\textbf{Input.} The input is data matrix $\bD \in \mathbb{R}^{M_1 \times M_2}$.

\smallbreak
\textbf{1. Data Preprocessing.}  Normalize the $\ell_2$-norm of the columns of $\bD$, i.e., set $\bd_i$ equal to $\bd_i / \| \bd_i \|_2$ for all $1 \le i \le M_2$.

\textbf{2.} Define $\bA = \big| \bD^T \bD|$.

\textbf{3.} Similar to Step 4 in Algorithm 1. 

\textbf{4.} Similar to Step 5 in Algorithm 1. 

\textbf{Output:} The identified clusters.

 }
\end{algorithm}

\vspace{-0.1in}
\subsection{Comparison with  the TSC Algorithm}
\label{sec:compTSC}
\vspace{-0.1in}
In this section, we theoretically compare the TSC algorithm against against MFC/iPursuit. 
 Both MFC/iPursuit and Algorithm 3  use inner-product as the kernel function to measure the similarity between data points. However, in sharp contrast to Algorithm 3,  MFC/iPursuit computes  the inner-product between the directions of innovation  and the data points as opposed to computing the inner-product between the data points. In \citep{rahmani2017innovationJ,rahmani2017subspace} and in this paper, it is shown that this difference makes MFC/iPursuit able to notably outperform TSC in most of scenarios. 
In order to clarify the reason behind this performance difference, we provide similar analysis for Algorithm 3 and we compare the requirements of MFC/iPursuit against those of Algorithm 3. Although the presented theorems only include sufficient conditions (not necessary conditions), their comparison is insightful. 

\begin{theorem}
Suppose $\bD$ follows Data Model 1. If
 \begin{eqnarray}
\begin{aligned}
 & 1 \ge \kappa \max_{i} \left\{ \frac{\| \alpha_i \|_2^p}{\| \bd_i \|_2^p}  \right\}   \frac{\bar{\Delta}_{\max}}{{\Delta}_{\min}}  + \\
 & \qquad \quad\kappa \max_{i} \left\{ \frac{\| \beta_i \|_2^p}{\| \bd_i \|_2^p}  \right\}  \phi^p \frac{\dot{\Delta}_{\max}}{{\Delta}_{\min}} \:,
\end{aligned}
\label{eq:sftc}
\end{eqnarray}
then $\bA = |\bD^T \bD|$ satisfies  Requirement (\ref{asm:sufficeint}). 
\label{thm:tsc_deter}
\end{theorem}
There are two terms on the RHS of the sufficient condition where only the second term is weighted by $\phi$. Even in the best case scenario where the innovative components  are orthogonal to each other, i.e., $\phi = 0$, it may not be possible to satisfy the sufficient condition. For instance, suppose $s/r$ is nearly equal to one and assume that the elements of $\beta_i$ and $\alpha_i$ are sampled independently from $\calN(0,1)$. In this scenario, $\mathbb{E}\left[\frac{\|\alpha_i\|_2^2}{\|\bd_i\|_2^2}\right] = \frac{s}{m} \approx	1$ and it may not be possible to satisfy the sufficient condition even for $\kappa=2$. The main reason is that when $s/m$ is high, the inner-product value between data points in different clusters are high, no matter how well separated the innovative components are.  In sharp contrast to Algorithm 3, MFC/iPursuit utilize the inner-product between the optimal directions and the data points to construct the adjacency matrix and when $s/m$ is high, the optimal directions are strongly incoherent with $\calS$ and this feature  makes the role of the innovative components  notably more significant. In order to make a more explicit comparison, we derive the sufficient condition for Algorithm 3 
while it is assumed that the data is generated as in Theorem \ref{thm:full_random}. The following theorem provides the result. 
\begin{theorem}
Suppose $\bD$ is generated as in Theorem \ref{thm:full_random} and $\bA = | \bD^T \bD |$.  If 
 \begin{eqnarray*}
  \begin{aligned}
&  \left(\frac{n}{r} - {\eta_{\delta}}_r \right) \ge  \kappa \:
\sigma_{u}\left( \frac{s}{r} , \delta \right) 
 \left(\frac{n}{r} + \frac{s}{r} \: {\eta_{\delta}}_s \right) + \\ &  \kappa\: \sigma_{u}\left( \frac{r-s}{r} , \delta \right)  \left(\frac{n}{r} + \frac{r-s}{r} \:{\eta_{\delta}}_{r-s} \right)\left( \frac{c_{\delta} (r-s)^2 }{M_1} \right)\:,
   \end{aligned}
 \end{eqnarray*}
then 
Requirement \ref{asm:sufficeint} with $p=2$ is satisfied with probability at least $1 - 9\delta$, where $c_{\delta}$ was defined in Theorem \ref{thm:full_random} and ${\eta_{\delta}}_x$ was defined in Theorem \ref{thm:random_point}.
\label{thm:sc_random}
\end{theorem}
The first term on the RHS of the sufficient condition of Theorem \ref{thm:sc_random} is the dominant term when $s$ is large. When there are a sufficiently large number of data points in the clusters ($n/r$ is  large enough), the sufficient condition roughly states that $\kappa \: \frac{s}{r}$ should be \textcolor{black}{sufficiently} smaller than $1$. However,  it is not feasible to satisfy this condition in many scenarios. For instance, if we choose $\kappa = 2$, then the sufficient condition can be satisfied only when $s/r > 0.5$. 

In summary,  comparing the sufficient conditions suggests that in sharp contrast to Algorithm 3 which fails when the span of clusters are close,  MFC/iPursuit can effectively leverage the innovative components of the clusters and if these innovative components are sufficiently separable ($\phi$ is sufficiently small), MFC/iPursuit might successfully distinguish the clusters. 

\begin{figure*}[h!]
\begin{center}
\mbox{
\includegraphics[width=1.3in]{m2.eps}\hspace{-0.1in}
\includegraphics[width=1.3in]{m10.eps}\hspace{-0.1in}
\includegraphics[width=1.3in]{m2_tsc.eps}\hspace{-0.1in}
\includegraphics[width=1.3in]{m10_tsc.eps}
\hspace{-0.1in}
\includegraphics[width=1.3in]{versus_m.eps}
}
\end{center}
\vspace{-0.15in}
           \caption{The first 4 plots (from LHS) show the elements of $\ba_{\calS_1} =  \frac{1}{n}\sum_{i=1}^{n} \ba_i$ with different number of clusters for MFC and Algorithm 3. The first $n=200$ data points lie in first cluster, $r=10$, $s=8$, and $M_1 = 400$. The last plot demonstrates parameter $\kappa^{'}$ defined in (\ref{eq:define_k_pr}) versus $m$. One can observe that in this experiment increasing $m$ improves the quality of the  adjacency matrix computed by MFC. }
    \label{fig:versus_m_matn}
\end{figure*}




\begin{figure*}[h!]
\begin{center}
\mbox{
\includegraphics[width=1.55in]{versus_m_20.eps}\hspace{-0.1in}
\includegraphics[width=1.55in]{versus_m_50.eps}\hspace{-0.1in}
\includegraphics[width=1.55in]{versus_m_300.eps}
\hspace{0.1in}
\includegraphics[width=1.55in]{versus_s.eps}
}
\end{center}
\vspace{-0.15in}
           \caption{First three plots from left: Clustering error versus the number of clusters for different values of $M_1$ where  $r=10$, $s=9$, and $n=100$. First plot from right: This plot demonstrates clustering error versus $s$. In this experiment, $M_1=40$, $r=10$, and $n=100$.}
    \label{fig:versus_m}
\end{figure*}


\vspace{-0.1in}
\section{Numerical Experiments}
\vspace{-0.1in}
This paper does not present a new clustering algorithm and the main focus was to provide a deep understating and analysis of the MFC/iPursuit algorithms. We refer the reader to \citep{kanatani2001motion,costeira1998multibody,boult1991factorization,vidal2011subspace,rahmani2017innovation,rahmani2017innovationJ} for numerical studies of the MFC/iPursuit algorithms. The focus of the presented experiments are to demonstrate some of the features of the algorithms which was predicted by the presented theoretical studies. 
 For iPursuit, MFC, and TSC, the graph preprocessing step (Step 4 in Algorithm 1) was done as follows. For each column of $\bA$, 8 largest elements were kept and the rest of elements were set to zero.   Clustering error  is defined as $\frac{N_e}{M_2}$ where $N_e$ is the total number of misclassified data points. In the appendix, we have included a simple numerical experiment showing that exact clustering can be achieved if Requirement 1 holds even for small values of $\kappa$. 

\vspace{-0.1in}
\subsection{The Dimension of Intersection Between the Subspaces}
\vspace{-0.1in}
In the presented deterministic results (Theorem \ref{thm: 1} and Theorem \ref{thm:with_inv}), we observed that $\frac{\| \beta_i \|_2}{\| \bd_i \|_2}$ is an important factor in the performance of MFC/iPursuit and in the probabilistic results, this factor appeared as $\frac{r-s}{r}$.
The purpose of this experiment is twofold. Firstly, we show that the accuracy of MFC/iPursuit degrades as $s$ increases (since $\frac{r-s}{r}$ decreases). Secondly, it is shown that MFC/iPursuit are notably robust against intersection between the span of  clusters comparing to most of other methods. The first plot (from right) in Figure \ref{fig:versus_m} shows clustering error versus $s$ where in this experiment $M_1=40$, $r=10$, and $n=100$ (the number of evaluation
runs was 50). One can observe that the accuracy of MFC/iPursuit degrades as $s$ increases. However, both of them notably outperform the other methods when $s$ is high. The main reason is that as the presented theoretical studies  indicated, the performance of MFC/iPursuit mainly depends on the coherency between the innovative components $\{ \dot{\calS}_i \}_{i=1}^m$ while most of other algorithms such as TSC require the span of clusters $\{ {\calS}_i \}_{i=1}^m$ to be sufficiently incoherent.  

\vspace{-0.1in}
\subsection{Number of Clusters}
\vspace{-0.1in}
\label{sec:exp_m}
In the theoretical results (Theorem \ref{thm: 1} and Theorem \ref{thm:with_inv}), it was shown that the quality of the adjacency matrix computed by MFC/iPursuit might improve when $m$ increases. Specifically, the theoretical results suggested that when data follows Data Model 1 and as long as increasing $m$ does not increase the coherency between $\{\dot{\calS}_{i}\}_{i=1}^{m}$, MFC/iPursuit can yield  a better adjacency matrix (an adjacency matrix with higher $ \min_i \frac{(m-1)\| {\ba_i}_{\calI_i} \|_p^p  }{ \| {\ba_i}_{\calI_i^{\perp}} \|_p^p }$)  if $m$ increases. 
 

The first three plots (from left) in Figure \ref{fig:versus_m} shows clustering error versus $m$ for different values of $M_1$ where in this experiment $r=10$, $s=9$, and $n=100$ (the number of evaluation
runs was 50). One can observe that when $M_1 = 300$ and when $M_1=50$, the accuracy of MFC/iPursuit improves when $m$ increases while when $M_1 = 20$, the accuracy degrades. The reason for this observation is that as the theoretical results indicated,  both the number of clusters and the coherency between the innovative components contribute to the performance of the algorithms. When $M_1$ is not sufficiently large, increasing $m$ increases the coherency between the innovative components and it degrades the performances of the algorithms. 

\subsection{Face Clustering}
%\textcolor{blue}{Face clustering is an interesting problem to test the subspace clustering algorithms with. } 
In this experiment, we use   the Extended Yale B
dataset which contains 64 images for each of 38 individuals
in frontal view and different illumination condition \cite{lee2005acquiring}. In this dataset, since all the images were taken from the same \textcolor{black}{frontal pose}, the
faces corresponding to each subject can be approximated with
a low-dimensional subspace \cite{basri2003lambertian}. 
Thus, the images in this dataset can be modeled as a union of
linear subspaces. In this experiment, we created $\bD$ via vectorizing each image and using each image as a column of $\bD$. To expedite the run-time, we projected the data on
the span of the first 500 left singular vectors. Define $\bs$ as the vector of the singular values of $\bD$ and define $\hat{\bs} = \frac{\bs}{\max_i \bs(i)}$. In MFC, we estimated $r_d$ equal to the number of elements of $\hat{\bs}$ which are greater than 0.01.
Table \ref{tab:faces_result} shows the clustering error of the clustering algorithms (number of misclassified data points divided by the total number of data points). One can observe that the performance of MFC and iPursuit are close to each other since they employ similar tools to build the adjacency matrix. In addition, they notably outperformed the other approaches and the main reason is that in this dataset the span of clusters are close to each other \citep{vidal2011subspace}. The presented theoretical results indicated that MFC/iPursuit could yield  a high quality adjacency matrix even if the span of clusters are close to each other because their performance mainly depend on the incoherency between the innovative components of the clusters. 

\begin{table}% [h]
\centering
\caption{Clustering error of different algorithms on the Extended Yale B dataset.  }
\begin{tabular}{|c  |c| c | c|c|c| }
\hline
    Algorithm    & iPursuit   &  MFC &  LRR & SSC &  TSC \\
  %  subjects    & DSC   &  SSC &  SSC-P & SCC &  TSC \\
\hline
   Clustering error   & 0.08   & 0.09
  & 0.6 & 0.29 & 0.71 \\
\hline
\end{tabular}
\label{tab:faces_result}
\end{table}

\subsection{Requirement 1}
We discussed the fact that Requirement 1 indicates how clear the estimated adjacency matrix represents the clustering structure of the data and it is similar to the sufficient condition established in \citep{ling2020certifying} which guarantees that the spectral clustering algorithm can yield exact clustering.  In this experiment, we assume that $m=4$ and $n=100$ which means $M_2 = 400$. In order to construct $\bA$, we sample each element of $\bA$ from half-normal distribution and we normalize the elements such that $
 \frac{\kappa}{m-1} \| {\ba_i}_{\calI_i^{\perp}} \|_1 = \| {\ba_i}_{\calI_i} \|_1  \:,  
$ for all $1 \le i \le M_2$. Figure \ref{fig:app} shows clustering error of the spectral clustering algorithm versus $\kappa$. One can observe that even a small value of $\kappa$ can guarantee exact clustering.  Although the minimum value of $\kappa$ for which we can guarantee exact clustering depends on the distribution of the elements of $\bA$, but it shows that (as the results in \citep{ling2020certifying} suggests), exact clustering can be achieved if the false connections are sufficiently weaker than the true connections. 

\begin{figure}[h!]
\begin{center}
\mbox{
\includegraphics[width=0.35\textwidth]{app.eps}
}
\end{center}
\vspace{-0.15in}
           \caption{Clustering error versus parameter $\kappa$. }
\label{fig:app}
\end{figure}

\subsection{Subspaces with different intersections}
In the presented theoretical studies, we utilized  a single subspace $\calS$ to model the intersection between the span of clusters to derive succinct sufficient conditions. In practice, every pair of clusters could have different intersecting subspaces. 
Define $\bS_t \in \mathbb{R}^{M_1 \times 12}$
as a basis for a 12 dimensional subspace. 
In this experiment, we build the span of each cluster as 
$$\bU_i = [\bS_i, \: \dot{\bU}_i] $$
where the columns of $\bS_i \in \mathbb{R}^{M_1 \times 9}$ are sampled from the columns of $\bS_t$ randomly for each cluster. Therefore, each pair of clusters could have different intersecting subspaces. Figure \ref{fig:app2} shows clustering error versus number of clusters where $M_1 = 60$. One can observe that the MFC algorithm yields accurate clustering of the data even if a single subspace does not model the intersection between all the pairs of subspaces. %The performance of 


\begin{figure}[h!]
\begin{center}
\mbox{
\includegraphics[width=0.35\textwidth]{different_intersections.eps}
}
\end{center}
\vspace{-0.15in}
           \caption{Clustering error versus number of clusters. }
\label{fig:app2}
\end{figure}





\vspace{-0.1in}
\subsection*{Conclusion}
\vspace{-0.1in}
It was shown that  iPursuit  is equivalent to a closed form matrix factorization based clustering algorithm if the direction search optimization problem is altered into a quadratic optimization problem. A novel analysis applicable to both algorithms were proposed which showed that in contrast to some of the other subspace clustering algorithms whose performance depend on the distance between the span of clusters, the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters. %Furthermore, we theoretically compared the TSC algorithm, which is a closed form method,  against MFC/iPursuit to explain why MFC strongly outperforms TSC in most cases while their computation complexities are similar. 


 \bibliography{bibfile}





\newpage
\appendix





\end{document}
