%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz,pgfplots,filecontents} % nice language for creating drawings and diagrams
\usepgfplotslibrary{groupplots}
\usepackage{color}
\input{math.tex}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\DeclareMathOperator{\asto}{\xrightarrow{\text{a.s.}}}
\DeclareMathOperator{\toind}{\xrightarrow{\mathcal{D}}}
\DeclareMathOperator{\mat}{Mat}
\DeclareMathOperator{\vect}{vec}
\DeclareMathOperator{\rank}{rank}
\DeclareMathOperator{\EE}{\mathbb{E}}
\DeclareMathOperator{\CPD}{CPD}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator{\var}{\mathbb{V}ar}
\DeclareMathOperator{\concat}{concat}
\DeclareMathOperator{\plog}{polylog}
\DeclareMathOperator{\supp}{sup}
\usepackage{color}
%\usepackage{authblk}
\usepackage{comment}
\usepackage{algorithm}
\usepackage{algpseudocode}
%\usepackage[ruled, lined, linesnumbered, commentsnumbered, longend]{algorithm2e}
%\SetKwInOut{Input}{Input}
%\SetKwInOut{KwOut}{Output}
%\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{remark}[theorem]{Remark}
\usetikzlibrary{patterns}
\usepackage{url}
 \definecolor{mitred}{rgb}{0.78, 0.39, 0.07}
 \usepackage{pifont}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Learning from Low Rank Tensor Data:\\ A Random Tensor Theory Perspective}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<mohamed.seddik@tii.ae>?Subject=Your UAI 2023 paper}{Mohamed El Amine Seddik}{}}
%\author[1]{Mohamed El Amine Seddik}
\author[2]{Malik Tiomoko}
\author[2]{Alexis Decurninge}
\author[1]{Maxim Panov}
\author[3]{Maxime Guillaud}
%\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Technology Innovation Institute\\
    PO Box: 9639, Masdar City\\
    Abu Dhabi, UAE
}
\affil[2]{%
    Huawei Technologies France\\
    Paris, France
}
\affil[3]{%
Inria / CITI Laboratory\\
     6 avenue des Arts\\
     69621 Villeurbanne, France
  }
  
  \begin{document}
\maketitle

\begin{abstract}
Under a simplified data model, this paper provides a theoretical analysis of learning from data that have an underlying low-rank tensor structure in both supervised and unsupervised settings. 
For the supervised setting, we provide an analysis of a Ridge classifier (with high regularization parameter) with and without knowledge of the low-rank structure of the data.
Our results quantify analytically the gain in misclassification errors achieved by exploiting the low-rank structure for denoising purposes, as opposed to treating data as mere vectors. 
We further provide a similar analysis in the context of clustering, thereby quantifying the exact performance gap between tensor methods and standard approaches which treat data as simple vectors.
\end{abstract}

\section{Introduction}
The current era of artificial intelligence tackles learning tasks leveraging millions or even billions of data.
These data lie in high-dimensional spaces and often come from multiple \textit{modes}, such as multiple modalities, multiple sensors, multiple sources, multiple types, and multiple (space, time, frequency, etc.) domains.
In other words, these data can naturally be seen as tensors, in which vectors and matrices are simply the 1-mode and 2-mode versions.

Tensors are a natural way to store data and their inner geometric structure is richer than the one-dimensional and the two-dimensional algebra~\citep{landsberg2012tensors}.
In particular, unlike matrices, low-rank tensor factorization is essentially unique under mild assumptions when the number of modes is greater than three.
Their ubiquity in numerous applications makes them increasingly important~\citep{sun2014tensors}, leading to a growing interest in tensor data analysis in the statistical learning community.

A large part of previous works on tensor theory applied to machine learning problems assume a low-rank representation of input data~\citep{anandkumar2014tensor, kadmon2019statistical} and estimate this representation using as main ingredient the CANDECOMP/PARAFAC decomposition (CPD; \cite{hitchcock1927expression}).
Indeed, the low-rank tensor structure is a natural sparsity hypothesis in the modeling of real data seen through high-dimensional inputs~\citep{koldabader2009}.
However, faced with tensor-structured data, a simple and commonly used approach consists in neglecting the tensor structure and reshaping them into a set of vectors, to which a classical machine learning algorithm is then applied.
In this work, we aim at analyzing simple machine-learning methods and quantifying their exact theoretical performances when neglecting versus considering the low-rank structure, thereby quantifying the theoretical gap between tensor methods and their vectorized counterparts.
%In this work, we challenge precisely this point by highlighting the fact that \textit{a considerable gain can be obtained by taking advantage of the low-rank tensor structure of the processed data rather than treating them as mere vectors; furthermore, we develop analytical tools allowing to quantify this gain on a simple statistical data model}.

In the literature, the low-rank tensor structure has been considered for example in tensor regression in a supervised setting~\citep{zhouIEEE2013} or clustering in an unsupervised setting~\citep{sunandli2019}.
The tensor structure has been shown to enhance the performance of learning models as a key ingredient of more complex learning architectures e.g. for multi-modal data or multi-spectral images~\citep{liang2019learning, chen2020tensor}, or in the design of advanced neural network architectures by replacing the flattening operation in fully connected layers of a Convolutional Neural Network by CPD-based operations~\citep{kossaifi2020tensor}.

On top of the performance gain shown by~\cite{kossaifi2020tensor}, the reduction of the number of parameters needed to describe the learned model is also significant.
Indeed, the gain in the size of the parameter space can be seen when the data samples are $k$-order tensors and have for example a rank-one underlying structure. 
In this case, if the dimensions of the tensor are $p_1\times \cdots \times p_k$, the dimension of the parameter space can be significantly reduced from $\prod_{j=1}^k p_j$ to $\sum_{j=1}^k p_j$.

All this literature motivates the analysis of learning algorithms when processing low-rank tensor structured data. To do so, we consider a simple framework where data are supposed to be low-rank tensors perturbed by some additive noise. The proposed framework directly extends the fundamental settings of binary classification in the vector case~\cite{mignacco2020role,wang2022binary}. Then, based on some random tensor theory tools (recalled in Section 2 of the supplementary material), we characterize the theoretical performance of linear methods (in both supervised and unsupervised settings) with and without incorporating the knowledge of the low-rank structure. We show analytically that the incorporation of this knowledge allows us to considerably improve the performance of the studied methods, in particular when the amount of training samples is limited or equivalently when data are high-dimensional. Thus, exploiting the structure of the data allows for obtaining equivalent performance with far fewer samples.

In this work, we limit our attention to a simple, tractable framework where data are generated as rank-one tensors with additive Gaussian noise (see Section~\ref{sec_model}). The main contributions brought by this paper are two-fold:
\begin{enumerate}
  \item We first consider a supervised learning setting where we provide a theoretical analysis of a Ridge classifier with and without incorporating the low-rank tensor structure of the data; see Section~\ref{sec_supervised}. The results extend the known misclassification in a vector case. Importantly, we show that the clever usage of a low-rank structure allows for significant improvement in classification performance, which is further quantified.

  \item We also consider an unsupervised setting by analyzing a linear clustering approach and a low-rank tensor counterpart (Section~\ref{sec_unsupervised}). Our analysis provides the theoretical conditions for which efficient clustering is possible both theoretically and algorithmically. In passing, we precisely quantify the performance gap between linear versus tensor-based clustering, thereby demonstrating the superiority of the latter in the considered setting.
\end{enumerate}
%
To the best of our knowledge, few works in the literature were focused on the \textit{exact characterization} of the performance of ML methods when processing tensor data with low-rank structures, even under our considered setting. This paper suggests new directions to fill in this gap leveraging on recent advances in random tensor theory (RTT). We demonstrate how RTT allows for the exact characterization of the performance of the considered methods while confirming practical insights about learning from low-rank tensor data. In particular, our results highlight that \textit{it takes fewer training samples to achieve better performances when the low-rank tensor structure of the data is leveraged}.

\paragraph{Notations:} $[n]$ denotes the set $\{1, \ldots, n\}$. Scalars are denoted by lowercase letters as $a,b,c$. Vectors are denoted by bold lowercase letters as $\va,\vb,\vc$. Matrices are denoted by bold uppercase letters as $\mA,\mB,\mC$. Tensors are denoted as $\tA, \tB, \tC$. $T_{i_1,\ldots, i_d}$ denotes the entry $(i_1,\ldots, i_d)$ of the tensor $\tT$. The inner product between two order-$d$ tensors $\tA$ and $\tB$ is denoted $\langle \tA, \tB \rangle = \sum_{i_1,\ldots, i_d} A_{i_1\ldots i_d}B_{i_1\ldots i_d}$. The $\ell_2$-norm of $\tA$ is denoted $\Vert \tA\Vert = \sqrt{ \langle \gA, \gA \rangle }$. For any vectors $\vu_{1},\dots, \vu_{d}$, contractions of a tensor $\tA$ are denoted by $\tA(\vu_{1},\dots,\vu_d)=\sum A_{i_1\ldots i_d}u_{1i_1}\dots u_{di_d}$. The notation $\bigotimes_{i=1}^k\vv_i$ stands for the tensor outer product between the vectors $\vv_1,\ldots, \vv_k$ with $[\bigotimes_{i=1}^k\vv_i]_{i_1\ldots i_k} = \prod_{j=1}^k v_{j i_j}$. $\mat_i(\tT)$ denotes the matrix obtained by unfolding the tensor $\tT$ w.r.t.\@ its $i$-th mode. $\tT\times_i \vu$ denotes the contraction of the tensor $\tT$ on the vector $\vu$ through mode $i$. $Q(x) = \frac{1}{\sqrt{2\pi}} \int_x^\infty e^{-\frac{t^2}{2}}dt$ corresponds to the Gaussian tail function. $\asto$ stands for the almost sure convergence and $\toind$ for the convergence in distribution. $\sS^{d-1}$ stands for the unit sphere in dimension $d$. We refer the reader to the supplementary material for definitions of tensor notations.


\section{Statistical data model}\label{sec_model}
Let us start from the classical prototypical model for the binary classification with the covariate $\vx \in \sR^{p}$ belonging to one of the two classes $\mathcal{C}_1$ or $\mathcal{C}_2$:
\begin{align}\label{eq_simple_model}
    \vx = (-1)^a \vmu + \vz \in \sR^{p}
\end{align}
with $a = 1$ for class $\mathcal{C}_1$ or $a = 2$ for class $\mathcal{C}_2$ (thus, class centroids are $\pm \vmu \in \sR^{p}$), and random noise $\vz \in \sR^{p}$. The optimal estimation procedures and rates in this model were studied extensively in the literature. Recent works of~\cite{mignacco2020role} and~\cite{wang2022binary} showed that asymptotically the optimal misclassification error behaves as $Q\left(\frac{m}{\sigma}\right)$ with $m = \sqrt{\frac{n}{p}}\|\vmu\|^2$ and $\sigma = \sqrt{\frac{n}{p} \|\vmu\|^2 + 1}$, where $n$ is a sample size.

In this work, we aim to extend the fundamental result above to more complex tensor-structured data. Let the observed samples be $n$ independent tensors $\tX_1,\ldots,\tX_n$ each of order $k$ and of dimension $p_1\times \cdots \times p_k$. We denote the dimensions $p = \sum_{j=1}^k p_j$ and $P = \prod_{j=1}^k p_j$. We suppose that the $\tX_i$'s are distributed in two classes $\mathcal{C}_1$ and $\mathcal{C}_2$ (of cardinality $n_1$ and $n_2$ respectively -- that is $n = n_1 + n_2$), such that for $\tX_i\in \mathcal{C}_a$ with $a\in\{1, 2\}$,
%Let the observed samples be $n$ independent tensor-structured data $\tX_1,\ldots,\tX_n$ each of order $k$ and of dimension $p_1\times \cdots \times p_k$. We denote the dimensions $p = \sum_{j=1}^k p_j$ and $P = \prod_{j=1}^k p_j$. We suppose that the $\tX_i$'s are distributed in two classes $\mathcal{C}_1$ and $\mathcal{C}_2$ (of cardinality $n_1$ and $n_2$ respectively -- that is $n = n_1 + n_2$), such that for $\tX_i\in \mathcal{C}_a$ with $a\in\{1, 2\}$,
\begin{align}\label{eq_data_model}
    \tX_i = (-1)^a \bigotimes_{j=1}^k \vmu_j + \tZ_i\in \sR^{p_1\times \cdots \times p_k},
\end{align}
where $\tZ_i$ is a random tensor with i.i.d.\@ standard Gaussian entries, $\vmu_j \in \sR^{p_j}$ for $j\in [k]$ are independent from the $\tZ_i$'s and $\tM = \bigotimes_{j=1}^k \vmu_j$ stands for the outer product between all the $\vmu_j$'s. Here, the rank-1 tensor term represents the informative part of the data, while $\tZ_i$ models corruption by additive noise. In the context of supervised binary classification, we are further given a vector of labels $\vy\in \sR^n$ such that $y_i=-1$ for $\tX_i \in \mathcal{C}_1$ and $y_i=1$ for $\tX_i \in \mathcal{C}_2$. Importantly, the model for the vector case~\eqref{eq_simple_model} is a particular instance of the tensor model~\eqref{eq_data_model} with $k = 1$.

Note that in this formulation, the noise variance is assumed constant, and the difficulty of the classification problem is controlled by the between-class distance $\Vert \tM\Vert$. Specifically, when $\Vert \tM\Vert = 0$ the classification is impossible whereas when $\Vert \tM\Vert$ is very large the classification becomes trivial. 
We also highlight that the classical high dimensional statistical model corresponds to the case $k=1$, and we consider a more general setting by taking any $k\geq 1$.

We denote the observed data tensor $\tX=[\tX_1, \ldots, \tX_n]\in \sR^{p_1\times \cdots \times p_k\times n}$ by concatenating all the $\tX_i$ along the $(k+1)$-th mode of dimension $n$. $\tX$ expresses in tensor form as
\begin{align}\label{eq_data_tensor}
    \tX = \tM \otimes \vy + \tZ,
\end{align}
where $\tZ=[\tZ_1, \ldots, \tZ_n]\in \sR^{p_1\times \cdots \times p_k\times n}$. Given the rank-one structure of the tensor mean $\tM$, the outer product $\tM \otimes \vy$ results in a rank-one tensor of order $k+1$. As such, the data tensor $\tX$ is a \textit{rank-one spiked random tensor model} of order $k+1$, where the signal part is $\tM \otimes \vy$ and $\tZ$ corresponds to the noise part. 

\begin{remark}[On the data model]
  Note that the RTT analysis presented below (following 
 \citep{seddik2021random}) extends trivially to a more general (rank-$r$) data model of the form $\sum_{i=1}^r \bigotimes_{j=1}^k \vmu_{j}^{(i)} + \tZ$ as long as the  $\vmu_j^{(i)}$'s are orthogonal and $r$ of order $O(1)$. On the other hand, for arbitrary $\vmu_j^{(i)}$'s, the analysis is non-trivial -- see the end of Section~\ref{sec_supervised} and the supplementary material for more details.
\end{remark}

Throughout the following sections, we assume a high-dimensional regime, i.e., the number of training samples $n$ scales linearly with the tensor dimensions $p_j$ while $\Vert \vmu_j \Vert$ remains constant.
\begin{assumption}[Growth rate]\label{ass_growth}
  For all $ j\in [k]$, $\frac{p_j}{n} = \mathcal{O}_n(1)$ and $\Vert \vmu_j \Vert = \mathcal{O}_n(1)$\footnote{The notation $a = \mathcal{O}_n(1)$ means that $a$ converges to a constant not depending on $n$ if $n\to \infty$.}. 
\end{assumption}
%
This is a classical assumption in learning theory and random matrix theory~\citep{pennington2017nonlinear, louart2018random, ali2017improved, mai2018random, tiomoko2020large, seddik2021unexpected}, which considers that the feature size scales linearly with the number of samples. Indeed, such scaling coincides with the case $k=1$ in Assumption~\ref{ass_growth}. Moreover, Assumption~\ref{ass_growth} is more realistic from the practical viewpoint in scenarios where a limited amount of samples is available, contrary to classical statistical settings which make the assumption that $p_i$ is fixed while $n\to \infty$.

%which yields that $\prod_{j=1}^k p_j$ must scale linearly with $n$ in the supposed case of tensor data. However, for $k\geq 2$, this requirement imposes a large number of training samples $n$ which might be difficult to achieve in practical settings. As such Assumption~\ref{ass_growth} is more realistic from the practical view point when dealing with tensor structured data.


\section{Main results}
\subsection{Supervised Setting}\label{sec_supervised}
Given the training data tensor $\tX$ in~\eqref{eq_data_tensor} and the corresponding labels vector $\vy$, a simple learning approach consists in reshaping $\tX$ into a data matrix $\mX\equiv \mat_{k+1}(\tX)\in \sR^{n\times P}$ with $P=\prod_{j=1}^k p_j$, and then training a Ridge classifier with some regularization parameter $\gamma \geq 0$, i.e.,
\begin{align}
    \min_{ \vw } \Vert \vy - \mX \vw \Vert^2 + \gamma \Vert \vw\Vert^2,
\end{align}
the solution of which writes explicitly as $\vw^* = \left( \mX^\top \mX + \gamma \mI \right)^{-1} \mX^\top \vy $. Since the two classes corresponding to the data model in \eqref{eq_data_model} are only separable through their means ($-\tM$ and $\tM$) and have the same covariance, we consider the study of the Ridge classifier for $\gamma \gg \Vert \tX^\top \tX \Vert $ which we refer to as \textit{$\infty$-Ridge classifier}\footnote{Known as the matched filter classifier in some literature and is proven to be optimal for the model in \eqref{eq_data_model} when $k=1$, as stated in \citep{tiomoko2021pca}.}. Therefore, the $\infty$-Ridge classifier consists in projecting the data matrix $\mX$ on the labels $\vy$ as\footnote{The normalization by $\sqrt{np}$ is considered for convenience and does not affect the performances of the considered methods. Moreover, under Assumption~\ref{ass_growth} the quantities $n$ and $p$ are of the same order which is equivalent to the standard normalization by $n$.}
\begin{align}
    \vw = \frac{1}{\sqrt{n p }} \mX^\top \vy,
\end{align}
%{\color{red} change the term "$\infty$-Ridge classifier" and connection with ridge.}
%Given the training data tensor $\tX$ in~\eqref{eq_data_tensor} and the corresponding labels vector $\vy$, a basic learning approach~\citep{tiomoko2021pca} consists in reshaping $\tX$ into a data matrix $\mat_{k+1}(\tX)\in \sR^{n\times P}$ with $P=\prod_{j=1}^k p_j$, and then building a $\infty$-Ridge classifier\footnote{Note that the $\infty$-Ridge classifier corresponds to a classical ridge regression classifier when the regularization parameter is set to $\infty$.} with parameters $\vw \equiv \vect(\tW)\in \sR^P$ ($\tW\in \sR^{p_1\times \cdots \times p_k}$) as\footnote{The normalization by $\sqrt{np}$ is considered for convenience and does not affect the performances of the considered methods. Moreover, under Assumption~\ref{ass_growth} the quantities $n$ and $p$ are of the same order which is equivalent to the standard normalization by $n$.}
where we recall that $p = \sum_{j=1}^k p_j$, for which the decision function (for a new datum $\tilde{\tX}_i \in \mathcal{C}_a$) is given by $f_{\text{R}}(\tilde{\tX}_i) = \langle \vw, \vect(\tilde{\tX}_i)\rangle$. This is equivalent in tensor notations to
\begin{align}\label{eq_weights_tensor}
    f_{\text{R}}(\tilde{\tX}_i) = \langle \tW, \tilde{\tX}_i\rangle \,\mathop{\lessgtr}_{\mathcal{C}_2}^{\mathcal{C}_1}\, 0, \quad \tW \equiv \frac{1}{\sqrt{n p }} \tX \times_{k+1} \vy.
\end{align}
%
As such, the $\infty$-Ridge classifier does not leverage the low-rank tensor structure of the underlying data model and treats the data as mere vectors. 
Our first result consists in characterizing the theoretical performance of the $\infty$-Ridge classifier for the data model in~\eqref{eq_data_tensor}:

\begin{theorem}[Performance of the $\infty$-Ridge classifier]\label{prop_gaussian_matched_filter}
  Under Assumption~\ref{ass_growth}, for $\tilde\tX_i \in \mathcal{C}_a$ with $a\in \{1, 2\}$ independent from the training set $\tX$, 
  \begin{align*}
    \frac{1}{\sigma} \left( f_{\text{R}}(\tilde{\tX}_i) - m_a \right) \toind \mathcal{N}(0, 1),
  \end{align*}
  where $m_a = (-1)^a \Vert \tM \Vert^2 \sqrt{\frac{n}{p}}$ and $\sigma = \sqrt{ \frac{n}{p} \Vert \tM \Vert^2 + \frac{P}{ p }}$. 
  Moreover, the misclassification error verifies with probability one
  $\sP \left( (-1)^a f_{\text{R}}(\tilde{\tX}_i) < 0 \mid \tilde\tX_i \in \mathcal{C}_a \right) - Q\left( \frac{ \vert m_a\vert }{\sigma} \right) \to 0$.
\end{theorem}
\begin{proof}
  See supplementary material.
\end{proof}

\begin{figure*}[t!]
  \centering
  %\includegraphics[width=\textwidth]{figs/hist_mf_vs_cp.pdf}
  \input{tikz/mf.tex}
  \input{tikz/cpmf.tex}
  \caption{Theoretical versus empirical histogram of the decision function $f_{\text{R}}(\tilde{\tX}_i)$ for the $\infty$-Ridge classifier as per Theorem~\ref{prop_gaussian_matched_filter} (left) and for the Tensor-Ridge classifier as per Theorem~\ref{prop_gaussian_cp_based_matched_filter} (right). We considered $n=200$ training data ($n_1=n_2=100$) that are tensors of shape $(15, 30, 20)$, distributed as the rank-one tensor model in~\eqref{eq_data_model} with the $\vmu_j$'s being randomly sampled vectors from spheres such that $\Vert \tM \Vert = 3$.}
  \label{fig_mf}
\end{figure*}

Theorem~\ref{prop_gaussian_matched_filter} states that the performance of the $\infty$-Ridge classifier depends solely on $\Vert \tM \Vert$ and the dimension ratios $\frac{n}{p}$ and $\frac{P}{p}$. 
Note that in classical high dimension statistics (e.g., $k=1$), the ratios $\frac{n}{p}$ and $\frac{P}{p}$ are constant as $n\to \infty$. While in the actual tensor setting for $k\geq 2$, the dimension $P$ has a polynomial growth in terms of $n$. Therefore, Theorem \ref{prop_gaussian_matched_filter} is more general since it captures the behavior of both regimes.
Moreover, since the data are mean-wise centered as per~\eqref{eq_data_model}, the optimal classification is obtained by taking the sign of the decision function which is also suggested theoretically since the optimal threshold is $\frac{m_1 + m_2}{2}=0$.
Figure~\ref{fig_mf} (left) provides a histogram of the decision function of the $\infty$-Ridge classifier and its theoretical density. 
%Under Assumption~\ref{ass_growth}, the mean $m_a$ remains constant while the variance $\sigma$ increases due to the term $\frac{P}{p}$ as the dimension of data increases. This phenomenon highlights the drawback of flattening the input data and not exploiting the low-rank structure of the mean tensor $\tM$. 

\paragraph{Tensor-based approach:} To improve classification accuracy, ones needs to retrieve the rank-one structure $\tM$ from the data. This can be performed by denoising $\tW$, specifically by replacing it with a low-rank tensor approximation, since it is a noisy version of $\tM$. Precisely, from the definition of $\tW$ in~\eqref{eq_weights_tensor} and $\tX$ in~\eqref{eq_data_tensor},
\begin{align}
    \tW = \sqrt{\frac{n}{p}} \bigotimes_{j=1}^k \vmu_j + \frac{1}{\sqrt{p}} \tilde\tZ,
\end{align}
where $\tilde \tZ = \frac{1}{\sqrt{n}} \tZ \times_{k+1}\vy= \frac{1}{\sqrt{n}} \sum_{i=1}^n y_i\tZ_i$. Since $\tilde \tZ$ is a sum of $n$ i.i.d.\@ random tensors normalized by $\sqrt{n}$, then $\tilde \tZ$ is also a random tensor with i.i.d.\@ standard Gaussian entries.
\begin{remark}[On the data distribution]
  Note that for the above supervised learning setting, the Gaussianity assumption on the $\tZ_i$ might be relaxed to any distribution with zero mean and unit variance, for which $\tilde\tZ$ remains a random tensor with i.i.d.\@ standard Gaussian entries by the central limit theorem.
\end{remark}

$\tW$ has the form of a \textit{spiked random tensor model} which has been studied in~\citep{seddik2021random}. In order to extract the hidden rank-one component of $\tW$, we consider the best rank-one approximation of $\tW$ which yields estimates of the means components $\vmu_j$'s (if the classes are separable, i.e., $\Vert \tM\Vert $ is large enough) and then replace the weights $\tW$ in the decision function by such rank-one approximation. Precisely, the best rank-one approximation of $\tW$ can be obtained by solving the following objective
\begin{align}\label{eq_MLE}
  (\lambda^*, \{\vu_i^*\}_{i=1}^k) = \argmin_{\lambda\in \sR^+, \vu_i\in \sS^{p_i - 1}} \Vert \tW - \lambda \bigotimes_{i=1}^k \vu_i \Vert_{\text{F}}^2,
\end{align} 
which corresponds to the maximum likelihood estimator (MLE). Computing the above MLE is NP-hard in the worst case~\citep{hillar2013most}. However, it is possible to compute consistent estimates of the rank-one components of $\tW$ in polynomial time, using tensor SVD\footnote{SVD applied to the unfolded tensor.} \citep{arous2021long, seddik2021random} or tensor power iteration (Algorithm~\ref{alg:tensor_power_iteration}) initialized with tensor SVD~\citep{auddy2021estimating} which was shown to yield more accurate estimation of the rank-one tensor $\tM$, provided that the difference between the class-wise means $\Vert \tM \Vert$ is larger than $ \mathcal{O}\left( P^{\frac14} / p^{\frac12} \right)$ as proved in~\citep{seddik2021random, auddy2021estimating}.

\begin{algorithm}[t!]
  \caption{Tensor Power Iteration~\citep{anandkumar2014tensor}}\label{alg:tensor_power_iteration}
  \begin{algorithmic}
    \Require An order $k$ tensor $\tW \in \sR^{p_1\times \cdots \times p_k}$ and initialization components $\vu_1^0, \cdots, \vu_k^0$.\\
    \hspace*{-.4cm}\textbf{Output:} Rank-one approximation of $\tW$.
    \State $(\vu_1, \cdots, \vu_k) \leftarrow (\vu_1^0, \cdots, \vu_k^0)$
    \While{Not convergence}
    \For{$i\in[k]$}
    	\State $ \vu_i \leftarrow \frac{\tW (\vu_1, \ldots, \vu_{i-1}, :, \vu_{i+1}, \ldots, \vu_k) }{\Vert \tW (\vu_1, \ldots, \vu_{i-1}, :, \vu_{i+1}, \ldots, \vu_k) \Vert }$
    \EndFor
    \EndWhile
  \end{algorithmic}
\end{algorithm}

\begin{figure*}[h!]
    \centering
    \input{tikz/errors_20_15_5.tex}
    %\input{tikz/errors_20_15_10_10.tex}
    \input{tikz/errors_10_7_5_15_13.tex}
    \caption{Theoretical versus empirical misclassification error of both $\infty$-Ridge classifier and Tensor-Ridge classifier classifiers. We considered $n$ training data as order $k$ tensors of dimensions $p_1\times \cdots\times p_k$ with $k\in\{3, 5\}$ having a rank-one structure as in~\eqref{eq_data_model} with the $\vmu_j$'s being randomly sampled vectors.}
    \label{fig_missclass_error}
\end{figure*}

 
In essence, extracting the rank-one component of $\tW$ consists of a denoising scheme which allows to considerably reduce the variance of the decision function, thereby providing better classification accuracy. Given the above MLE which we denote $\lambda^* \bigotimes_{i=1}^k \vu_i^* $, the Tensor-based $\infty$-Ridge classifier, which we refer to as \textit{Tensor-Ridge} (TR), is defined for a new datum $\tilde \tX_i \in \mathcal C_a$ as 
\begin{align}
    f_{\text{TR}}(\tilde{\tX}_i) = \left\langle \lambda^* \bigotimes_{i=1}^k \vu_i^*, \tilde\tX_i \right\rangle \,\mathop{\lessgtr}_{\mathcal{C}_2}^{\mathcal{C}_1}\, 0.
\end{align}
%
We introduce the following quantities in~\eqref{eq_formulas} from~\citep{seddik2021random} which describe the behavior of a $k$-order spiked random tensor model and shall be used subsequently. 
\begin{align}\label{eq_formulas}
  \begin{cases}
    f(z, \beta) = z + g(z) - \beta \prod_{i=1}^k q_i (z),\\
    q_i(z) = \sqrt{ 1 - \frac{g_i^2(z)}{c_i} },
  \end{cases}
\end{align}
where $c_i = \lim_{p_i\to \infty} \frac{p_i}{\sum_{j=1}^k p_j}$ and $(g(z), g_i(z))$ are solutions to the following system
\begin{align}
    \begin{cases}
    g(z) = \sum_{i=1}^k g_i(z),\\
    g_i^2(z) - (g(z) + z) g_i(z) - c_i = 0.
    \end{cases}
\end{align}
%
Essentially, it was proved in~\citep{seddik2021random} that the above equations are well defined for $\beta$ greater than some threshold $\beta_s = \mathcal{O}(1)$. The latter corresponds basically to the classes separability condition on $\Vert \tM \Vert$ above which the MLE in \eqref{eq_MLE} starts to correlate with $\tM$.

Therefore, our following result characterizes the theoretical performance of the Tensor-Ridge classifier based on the above random tensor tools.

\begin{theorem}[Performance of the Tensor-Ridge classifier]\label{prop_gaussian_cp_based_matched_filter}
  Under Assumption~\ref{ass_growth}, for $\tilde\tX_i \in \mathcal{C}_a$ with $a\in \{1, 2\}$ independent from the training set $\tX$, 
  \begin{align*}
    \frac{1}{\sigma} \left( f_{\text{TR}}(\tilde{\tX}_i) - m_a \right) \toind \mathcal{N}(0, 1),
  \end{align*}
  where $m_a = (-1)^a \sigma \Vert \tM \Vert \prod_{j=1}^k q_j\left( \sigma \right) $ and $\sigma $ satisfies $f\left(\sigma, \Vert \tM \Vert \sqrt{\frac{n}{p}}\right) = 0$ where $q_j$ and $f$ are defined in~\eqref{eq_formulas}. Furthermore, the misclassification error verifies with probability one $\sP \left( (-1)^a f_{\text{TR}}(\tilde\tX_i) < 0 \mid \tilde\tX_i \in \mathcal{C}_a \right) - Q\left( \frac{ \vert m_a\vert }{\sigma} \right) \to 0$.
\end{theorem}
\begin{proof}[Sketch of proof]
  The proof relies on estimating the expectation and the variance of the decision function $f_{\text{TR}}(\tilde\tX_i)$ for some $\tilde\tX_i \in \mathcal{C}_a$ with $a\in \{1, 2\}$ independent from the training set $\tX$. Indeed, one finds that $\EE f_{\text{TR}}(\tilde\tX_i) = \EE\left[ (-1)^a \Vert \tM \Vert \lambda^* \prod_{j=1}^k \langle \frac{\vmu_j}{\Vert \vmu_j\Vert }, \vu_j^* \rangle \right]$ where the quantities $\lambda^*$ and $\langle \frac{\vmu_j}{\Vert \vmu_j\Vert }, \vu_j^* \rangle$ are estimated using~\eqref{eq_formulas} where $\lambda^*\to \sigma$ with $\sigma$ satisfying $f(\sigma, \Vert \tM \Vert \sqrt{\frac{n}{p}}) = 0$ and $\langle \frac{\vmu_j}{\Vert \vmu_j\Vert }, \vu_j^* \rangle \to q_j(\sigma)$. The variance of $f_{\text{TR}}(\tilde\tX_i)$ is computed similarly and we find $\Var [f_{\text{TR}}(\tilde\tX_i) ] = \sigma^2$. See supplementary material for detailed proof.
\end{proof}

\begin{figure*}[t!]
  \begin{center}
    %\includegraphics[width=\textwidth]{figs/mf_cp_algo.pdf}
    \input{tikz/heatmaps.tex}
  \end{center}
  \caption{Theoretical misclassification error in terms of the signal strength $\Vert \tM \Vert$ and the ratio $p/n$ for three order tensors of size $(p, p, p)$. For both $\infty$-Ridge and Tensor-Ridge as per Theorems~\ref{prop_gaussian_matched_filter} and~\ref{prop_gaussian_cp_based_matched_filter} respectively. The third plot from the left corresponds to polynomial time Tensor-Ridge which is possible for $\Vert \tM\Vert $ larger than $\mathcal{O}(p^{\frac{1}{4}})$ while the last plot corresponds to oracle classifier which assumes perfect knowledge of $\tM$.}
    \label{fig_phase_diagram_supervised}
\end{figure*}

\begin{remark}[On the assumptions]
  Theorem~\ref{prop_gaussian_cp_based_matched_filter} requires additional assumptions (e.g., Assumption 3 from~\citep{seddik2021random}). We highlight that this assumption is rather technical and needs the introduction of various notions (e.g., defining the block-wise contracted matrix introduced by~\citep{seddik2021random}). However, note that Assumption 3 therein is always satisfied by the maximum likelihood estimator when the SNR is larger than some $\mathcal{O}(1)$ constant. In our notation the SNR corresponds to the quantity $\Vert \tM\Vert$ which controls the difficulty of the classification problem.
\end{remark}

Theorem~\ref{prop_gaussian_cp_based_matched_filter} states that the performance of the Tensor-Ridge classifier depends solely on $\Vert \tM \Vert$ and the ratio $\frac{p}{n}$, and does not depend on the dimension $P$ as was the case for the $\infty$-Ridge classifier in Theorem~\ref{prop_gaussian_matched_filter}. This highlights that the variance $\sigma^2$ for the Tensor-Ridge classifier remains constant as $n\to \infty$. We can further observe this from Figure~\ref{fig_mf} which shows that Tensor-Ridge yields a lower variance. %, thereby yielding a better classification accuracy for small values of the number of training samples $n$}. 

Figure~\ref{fig_missclass_error} depicts the theoretical versus empirical misclassification error for both methods. It particularly shows that the Tensor-Ridge classifier yields drastically better performances (almost closer to the oracle which assumes perfect knowledge of $\tM$) when $n$ is small, or alternatively when the dimension of data is high. Note that the empirical curves for the Tensor-Ridge classifier are obtained with tensor power iteration initialized with tensor SVD, and thus converges in polynomial time if $\Vert \tM\Vert $ is larger than $\mathcal{O}\left( P^{\frac14} / p^{\frac12} \right)$ as discussed previously. In particular, the last line in Figure~\ref{fig_missclass_error} highlights this phenomenon, where we can see that the power iteration does not always converge when we increase the tensors order.

%\begin{figure*}[t!]
%    \centering
    %\includegraphics[width=\textwidth]{figs/order31.pdf}
    %\includegraphics[width=\textwidth]{figs/order32.pdf}
    %\includegraphics[width=\textwidth]{figs/order4.pdf}
    %\includegraphics[width=\textwidth]{figs/order5.pdf}
%    \caption{Theoretical versus empirical misclassification error of both $\infty$-Ridge classifier (MF) and CP-based $\infty$-Ridge classifier (CP-MF) classifiers. We considered $n$ training data as $k$-order tensors with $k\in\{3, 4, 5\}$ of dimensions $p_i$'s having a rank-one structure as in~\eqref{eq_data_model} with the $\vmu_j$'s being randomly sampled vectors.}
    %\label{fig_missclass_error_order}
%\end{figure*}

Moreover, Figure~\ref{fig_phase_diagram_supervised} depicts the misclassification error of both methods varying the ratio $p/n$ and $\Vert \tM \Vert$. It shows that the Tensor-Ridge classifier performs better for large values of $p/n$ in theory (second plot from the left). More interestingly, the third plot depicts the computationally possible performance which corresponds to the algorithmic threshold $\Vert \tM\Vert \geq \mathcal{O}\left( P^{\frac14} / p^{\frac12} \right)$, thereby highlighting the superiority of the tensor-based approach even algorithmically. The last plot corresponds to perfect knowledge of $\tM$ and provides an insight about the effect of the noise component in the considered data model.
%This toy example clearly demonstrates that one can benefit from the underlying data structure, if such information is available. We will see that these conclusions also extend to an unsupervised setting, where no labels are provided.



\paragraph{Generalization to higher-rank data:} Our results generalize to a more complex model of the following form. Suppose that the $\tX_i$'s are distributed in two classes $\mathcal{C}_1$ and $\mathcal{C}_2$ (of cardinality $n_1$ and $n_2$ respectively), such that for $\tX_i\in \mathcal{C}_a$ with $a\in {1, 2}$, 
\begin{align}\label{eq_general_data_1}
    \tX_i = \sum_{\ell=1}^{r_a} \bigotimes_{j=1}^k \vmu_{j,\ell}^{(a)} + \tZ_i \in \sR^{p_1\times \cdots \times p_k},
\end{align}
where $\tZ_i$ is a random tensor with i.i.d. standard Gaussian entries, $\vmu_{j,\ell}^{(a)}\in \sR^{p_j}$ are independent from $\tZ_i$ such that $\langle \vmu_{j,\ell_1}^{(a)}, \vmu_{j,\ell_2}^{(a)} \rangle = \delta_{\ell_1 \ell_2}$. That is, the data tensors $\tX_i$ have a rank-$r_a$ (with $r_a$ being independent of the dimensions $p_i$) structure with orthogonal components.

Let us denote $\tM_a = \sum_{\ell=1}^{r_a} \bigotimes_{j=1}^k \vmu_{j,\ell}^{(a)}$ the mean tensor of class $\mathcal{C}_a$. In a supervised setting, it is convenient to center the data by subtracting\footnote{In real scenarios one would first estimate the $\tM_a$'s with their empirical estimates through tensor decomposition.} $\frac12(\tM_1 + \tM_2)$ from each data sample which yields tensors of the form
\begin{align}\label{eq_general_data}
    \tX_i = (-1)^a \left( \tM_1 - \tM_2 \right) + \tZ_i,
\end{align}
where $\tM_1 - \tM_2$ is clearly a low-rank tensor (of rank $r_1 + r_2$) with orthogonal components. Stacking all the data samples $\tX_i$ in a data tensor $\tX\in \sR^{p_1\times \cdots \times p_k \times n}$, the $\infty$-Ridge classifier has weights tensor of the form
\begin{align}
    \tW = \frac{1}{\sqrt{np}} \tX \times_{k+1} \vy = \sqrt{\frac{n}{p}} \tM + \frac{1}{\sqrt{p}}\tilde \tZ,
\end{align}
where $\tilde \tZ = \frac{1}{\sqrt{n}} \sum_{i=1}^n y_i \tZ_i$ and $\tM = \tM_1 - \tM_2 = \sum_{\ell=1}^{r_1 + r_2} \bigotimes_{j=1}^k \vmu_{j,\ell}$ is a rank-$(r_1 + r_2)$ tensor. Therefore, the Tensor-Ridge classifier for this case relies on a low-rank approximation of $\tW$ of rank $r_1 + r_2$ which can be obtained through standard tensor decomposition methods (e.g., tensor deflation \citep{ge2021understanding}).
We, therefore, have the following theorem characterizing the performance of the Tensor-Ridge classifier in this more general case.
\begin{theorem}[Performance of the Tensor-Ridge classifier for data model in~\eqref{eq_general_data}]
  Under Assumption 2.2, for $\tilde \tX_i \in \mathcal{C}_a $ with $a \in \{1, 2\}$ independent from the training set $\tX$,
  \begin{align*}
    \frac{1}{\sqrt{\sum_{\ell=1}^{r_1 + r_2} \sigma_\ell^2}} \left( f_{\text{TR}}(\tilde \tX_i) - m_a \right) \toind \mathcal{N}(0, 1),
  \end{align*}
  where $m_a = (-1)^a \sum_{\ell=1}^{r_1 + r_2} \sigma_\ell \mu_\ell \prod_{j=1}^k q_j(\sigma_\ell )$ where $\mu_\ell = \Vert \bigotimes_{j=1}^k \vmu_{j,\ell} \Vert$ and $\sigma_\ell$ satisfies $f(\sigma_\ell, \mu_\ell \sqrt{\frac{n}{p}}) = 0$. $q_j$ and $f$ are defined in~\eqref{eq_formulas}. Furthermore, the misclassification error verifies with probability one $\sP \left( (-1)^a f_{\text{TR}}(\tilde\tX_i) < 0 \mid \tilde\tX_i \in \mathcal{C}_a \right) - Q\left( \frac{ \vert m_a\vert }{\sqrt{\sum_{\ell=1}^{r_1 + r_2} \sigma_\ell^2}} \right) \to 0$.
\end{theorem}
\begin{proof}
  The proof strategy is the same as for Theorem 3.3.
\end{proof}


\subsection{Unsupervised Setting}\label{sec_unsupervised}
In a setting where only $n$ training samples $\tX_1,\ldots, \tX_n$ are provided without their corresponding labels, one would rely on unsupervised learning to cluster them into classes. Given the data model in~\eqref{eq_data_tensor},
%without loss of generality, we further assume that the data are ordered following their class order, 
a simple unsupervised learning approach~\citep{ng2002spectral} consists in unfolding $\tX$ into a $n\times P$ matrix as
\begin{align}
    \mX = \mat_{k+1}(\tX) = \vy\vect(\tM)^\top + \mat_{k+1}(\tZ),
\end{align}
then estimating the labels $\vy$ through the dominant eigenvector of the Gram matrix $\mX \mX^\top$ denoted $\hat{\vy}$, which coincides with the dominant left singular vector of $\mX$. The theoretical performance of this \textit{linear spectral method} is given by the following theorem.
\begin{theorem}[Performance of linear spectral clustering]\label{prop_linear_clustering}
  Let $\hat{\vy}$ be the right singular vector of $\mX$ corresponding to its largest singular value. The estimated class for the datum $\tX_i$ is given as $\hat{\mathcal{C}}_i = \sign(\hat{y}_i)$. Then under Assumption~\ref{ass_growth},
  \begin{align*}
    \frac{1}{\sigma}\left( \sqrt{n} \hat{y}_i - \alpha y_i \right)\toind \mathcal{N}(0, 1),
  \end{align*}
  where $\alpha = \kappa\left(\Vert \tM \Vert \sqrt{ \frac{n}{P + n}}, \frac{n}{P+n} \right)^{-1}$, $\sigma = \sqrt{1 - \alpha^2}$ and $\kappa(\beta, c) = \beta \sqrt{\frac{ \beta^{2} \left(\beta^{2} + 1\right) - c \left(c - 1\right) }{ (\beta^4 + c(c-1)) \left( \beta^{2} + 1 - c \right)}}$ defined for $\beta > (c(1-c))^{\frac14}$. Furthermore, the misclassification error is given with probability one by $Q \left( \frac{\alpha}{ \sqrt{1 - \alpha^2} } \right)$.
\end{theorem}

\begin{proof}
  See supplementary material.
\end{proof}

\begin{figure}[t!]
    \centering
      \begin{center}
    %\includegraphics[width=.49\textwidth]{figs/clustering.pdf}
    \input{tikz/clustering.tex}
  \end{center}
  \caption{Left: the $2D$ projection space obtained by linear clustering. Right: the $2D$ projection space by Tensor-based clustering obtained through a rank-two CP decomposition of $\tX$. We considered $k=2$ and $n_1 = n_2 = 500$ and data are tensors $\tX_i$ of shape $(15, 30, 20)$ generated as the model in~\eqref{eq_data_model} with $\Vert \tM\Vert = 3$. The ellipses correspond to the theoretical means and fluctuations according to Theorems~\ref{prop_linear_clustering} and~\ref{prop_CP_clustering} respectively.} 
    \label{fig_clustering}
\end{figure}

Theorem~\ref{prop_linear_clustering} states that the entries of the estimated left singular vector corresponding to the largest singular value of $\mX$ is a Gaussian random variable, whose mean and variance depend on $\Vert \tM\Vert$ and the ratio $c = \frac{n}{P+n}$. 
Essentially, in order to obtain a non-zero correlation between $\hat\vy$ and $\vy$, the signal strength $\Vert \tM \Vert $ must be greater than $\frac{\sqrt[4]{c(1-c)}}{\sqrt{c}}$. However, under Assumption~\ref{ass_growth}, the ratio $\frac{n}{P+n} \to 0$ if $n\to \infty$, thereby yielding a high misclassification error. Indeed, Figure~\ref{fig_clustering} (left) depicts the $2D$ projection space corresponding to the two largest eigenvectors of $\mX\mX^\top$ along with its theoretical mean and fluctuations as per Theorem~\ref{prop_linear_clustering}. Note that the second largest eigenvector of $\mX\mX^\top$ is not informative about the classes. In fact, its entries have zero mean and variance $1/n$, which is a classical result from random matrix theory \citep{o2016eigenvectors}.

In contrast, extracting the rank-one structure of the data tensor allows us to improve the classification performance. Indeed, given the data model in~\eqref{eq_data_tensor}, computing a rank-$1$ approximation of $\tX$ and extracting the corresponding $(k+1)$-th mode component yields a better estimation of the labels vector $\vy$. We precisely have the following theorem characterizing the performance of \textit{the Tensor-based clustering}.

\begin{theorem}[Performance of Tensor-based clustering]\label{prop_CP_clustering}
  Let $\hat{\vy}$ be the $(k+1)$-th mode component of the rank-$1$ tensor approximation of $\tX$. The estimated class for the datum $\tX_i$ is given as $\hat{\mathcal{C}}_i = \sign(\hat{y}_i)$. Then under Assumption~\ref{ass_growth},
  \begin{align*}
    \frac{1}{\sigma}\left( \sqrt{n} \hat{y}_i - \alpha y_i \right)\toind \mathcal{N}(0, 1),
  \end{align*}
  where $\alpha = q_{k+1}\left( \lambda^* \right)$, $\sigma = \sqrt{1 - \alpha^2}$ with $q_{k+1}(\cdot)$ defined by~\eqref{eq_formulas} for a tensor of order $k+1$ and $\lambda^*$ is the unique solution to $f\left(\lambda^*, \Vert \tM \Vert \sqrt{\frac{n}{p+n}}\right) = 0$. Furthermore, the misclassification error is given with probability one by $Q \left( \frac{\alpha}{ \sqrt{1 - \alpha^2} } \right)$.
\end{theorem}
\begin{proof}
  See supplementary material.
\end{proof}

\begin{remark}
    The generalization of the unsupervised setting to the data model in~\eqref{eq_general_data_1} is more challenging since the data tensor $\tX$, in this case, does not follow a CP decomposition but rather a block-term decomposition~\citep{rontogiannis2021block} which is more challenging to analyze theoretically and is therefore left for a future investigation.
\end{remark}

As for the linear clustering approach, the estimated labels vector $\hat{\vy}$ with tensor clustering has Gaussian entries centered on the scaled labels $\vy$ with a scaling factor $\alpha$ and fluctuations depending on such $\alpha$. However, now the clustering performance depends on $\Vert \tM\Vert$ and the ratio $\frac{n}{p+n}$, thereby yielding the same clustering performance as $n$ and $p$ increase at the same rate. Figure~\ref{fig_clustering} (right) depicts the $2D$ projection space obtained by a rank-two CP decomposition of $\tX$ with its theoretical mean and fluctuations as per Theorem~\ref{prop_CP_clustering}. From Figure~\ref{fig_clustering} we clearly note that tensor-based clustering yields lower variance compared to the classical linear approach, thereby allowing better clustering performance. This improvement is relatively trivial given the knowledge of the underlying rank-one structure, but our results allow the exact characterization of the performance gap between both methods.

\begin{figure}[t!]
    \centering
    %\includegraphics[width=.49\textwidth]{figs/clustering_error.pdf}
    \input{tikz/phase_transition_clustering.tex}
    \caption{Theoretical misclassification errors in terms of the signal strength $\Vert \tM \Vert$ for both linear and Tensor-based clustering as per Theorems~\ref{prop_linear_clustering} and~\ref{prop_CP_clustering} respectively. We considered $n_1=n_2=100$ and data are tensors of shape $(15, 30, 10)$.}
    \label{fig_clustering_error}
\end{figure}

To best illustrate the comparison between linear and Tensor-based clustering, we depict the misclassification errors of both methods in terms of $\Vert \tM \Vert$ in Figure~\ref{fig_clustering_error}. Essentially, in order to have a correlation between $\hat{\vy}$ and $\vy$, the signal strength $\Vert \tM \Vert$ must be greater than some $\mathcal{O}(1)$ threshold in theory. However, in order to estimate the label signal in practice in polynomial time, $\Vert \tM \Vert$ must be greater than $ \mathcal{O} \left( (P\times n)^{\frac14} / (p+n)^{\frac12} \right) $, which coincides with the phase transition of linear clustering from Theorem~\ref{prop_linear_clustering}. Importantly, Figure~\ref{fig_clustering_error} depicts three different regions: (i) impossible: it is information-theoretically impossible to recover the clusters or even detect them, in the sense that any clustering method output is provably independent of the true classes; (ii) NP-hard: where there is no polynomial time algorithm that can recover the labels signal, and (iii) possible: where recovery is possible in polynomial time (e.g., using tensor power iteration initialized with tensor SVD as discussed previously). Figure~\ref{fig_clustering_error} clearly highlights the benefit of Tensor-based clustering upon linear clustering if the data has an underlying low-rank structure. Notably, the performances of the different approaches are accurately estimated by Theorems~\ref{prop_linear_clustering} and~\ref{prop_CP_clustering}.


\section{Concluding remarks}
This paper has brought a theoretical analysis of learning from tensor data that have a hidden low-rank structure. 
Both analytical and empirical assessments suggest that a considerable performance gain can be achieved by exploiting such low-rank tensor structure when few training samples are available and such gain is accurately quantified for the considered statistical model in~\eqref{eq_data_model}. 

As such, the paper explicitly demonstrates the application of \textit{random tensor theory} to evaluate the performance of simple learning methods (such as the considered Tensor-Ridge classifier), whose behavior was not so far theoretically understood.
This paves the way for more systematic theoretical analysis and improvement of sophisticated machine learning algorithms when dealing with tensor-structured data. In particular, our present analysis can be extended for the understanding of the CP-regressor \citep{zhouIEEE2013} which basically consists of a Ridge regressor with low-rank tensor prior, which is more adapted for low-rank tensor data with covariance structure.

% References
\bibliography{uai2023}

\end{document}
