% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{xr-hyper}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}% latexmk will find this if $recorder=0 (however, in that case, it will ignore #1 if it is a .aux or .pdf file etc and it exists! if it doesn't exist, it will appear in the list of dependents regardless)
  \@addtofilelist{#1}% if you want it to appear in \listfiles, not really necessary and latexmk doesn't use this
  \IfFileExists{#1}{}{\typeout{No file #1.}}% latexmk will find this message if #1 doesn't exist (yet)
}
\makeatother
\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{ling_484-supp}

%% Choose your variant of English; be consistent
\usepackage{float}
\usepackage[american]{babel}
\usepackage{amssymb,amsmath,bm}
\newtheorem{prop}{Proposition}
\newtheorem{corollary}{Corollary}
\newtheorem{proof}{Proof}
\newtheorem{remark}{Remark}
\newtheorem{definition}{Definition}
\DeclareMathOperator{\arcsinh}{arcsinh}

\usepackage{subfig}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Dimension Reduction for High-dimensional Small Counts with KL Divergence}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:Yurong Ling <yurong.ling.16@ucl.ac.uk>?Subject=Your UAI 2021 paper}{Yurong Ling}{}} % Lead author
\author[1]{Jing-Hao Xue}
% Add affiliations after the authors
\affil[1]{%
    Department of Statistical Science\\
    University College London\\
    London, UK
}

  \begin{document}
\maketitle

\begin{abstract}
   Dimension reduction for high-dimensional count data with a large proportion of zeros is an important task in various applications. As a large number of dimension reduction methods rely on the proximity measure, we develop a dissimilarity measure that is well-suited for small counts based on the Kullback-Leibler divergence. We compare the proposed measure with other widely used dissimilarity measures and show that the proposed one has superior discriminative ability when applied to high-dimensional count data having an excess of zeros. Extensive empirical results, on both simulated and publicly-available real-world datasets that contain many zeros, demonstrate that the proposed dissimilarity measure can improve a wide range of dimension reduction methods. 
\end{abstract}

\section{Introduction}\label{sec:intro}
High-dimensional count data, especially those with a large proportion of zeros, are omnipresent in various fields, such as ecology and genomics~\citep{VST_problem,Townes2019,Svensson2020}.
Dimension reduction (DR) techniques are used to extract useful information from high-dimensional count data, by eliminating noisy/uninformative dimensions of the data.
Owing to the mean-variance dependency that is often observed in count data, it is inappropriate to apply standard DR methods that are optimal under the normality assumption, such as principal component analysis (PCA)~\citep{pca,pca1,ppca} and Gaussian process latent variable model (GPLVM)~\citep{gplvm}, to the data.

Hence, to perform DR on count data, a number of specific strategies/methods have been proposed. A common strategy is to first apply a variance-stabilizing transformation (VST) to the data~\citep{VST3,VST}, aiming to make the data more Gaussian-like, and then feed the transformed data into standard DR approaches. 
The transformation function is specifically chosen to remove the mean-variance dependency. Popular transformation functions include the square root, logarithm, and inverse hyperbolic sine functions. 
Despite the widespread use of the VSTs, they can only be guaranteed to work well with large counts~\citep{VST3,VST} and cannot reasonably be expected to stabilize the variance of small counts containing a large faction of zeros~\citep{VST_problem_1,VST_problem}. 
Rather than focusing on making count data more normally distributed, several approaches have been developed to directly model the original data. With the assumption that count data follow the exponential family distributions, PCA variants maximise the likelihood of the observed data to get the low-dimensional representation~\citep{GPCA,BGPCA,SEPCA,SPPCA}. By adopting the same distributional assumption, a robust estimator of the covariance matrix is derived and the data of reduced dimension are obtained by the eigendecomposition of this estimator~\citep{ePCA}. 
Nonnegative matrix factorization (NMF) acquires the low-dimensional representation by factorizing count data matrix into two nonnegative matrices of low rank~\citep{NMF_bregman,NMF_beta,NMF_alpha_beta}. Despite the popularity of NMF and PCA variants, it is unclear whether they perform well on count data having an excess of zeros.

Unlike the aforementioned works, we focus on developing measures that can reliably quantify the pairwise dissimilarity for small count data, motivated by the importance of proximity matrix in common DR frameworks. Specifically, many DR approaches seek to preserve properties of a proximity matrix of high-dimensional data when reducing the dimension of the data. Examples of such approaches include PCA with the Euclidean distance matrix and the Gram matrix~\citep{kent1979multivariate,gplvm}, GPLVM with the Gram matrix~\citep{gplvm}, multidimensional scaling (MDS) with an input dissimilarity matrix~\citep{mds}, and t-distributed stochastic neighbour embedding (tSNE) with the matrix of the Gaussian kernels~\citep{tsne}.
Therefore, a proximity measure that properly quantifies the dissimilarity between small-count data points could benefit a wide range of DR methods. 

The two core contributions of this paper can be summarized as follows. First, we develop two dissimilarity measures for small count data based on the Kullback-Leibler (KL) divergence~\citep{kullback} and the assumption that the data follow either Poisson or negative binomial (NB) distributions. 
We take both Poisson and NB distributions into account, as it is common to model count data with these two types of distributions~\citep{zeileis2008regression,bayesianpoisson,nb1,Townes2019,Kim2020}.
Furthermore, to reliably calculate the KL divergence, we propose to use empirical Bayes estimators to estimate the distributional parameters.
Secondly, we propose an index to evaluate the discriminative abilities of different dissimilarity measures and show that the measure developed with the NB assumption has superior discriminative ability compared with other widely used dissimilarity measures for high-dimensional small counts, in terms of their statistical behaviours.
Moreover, consistent with our statistical investigation, the experimental results, on both real and simulated count data, also demonstrate that the measure obtained with the NB assumption is superior to other measures when handling small counts. 

The rest of this paper can be summarized as follows. First, we present standard transformations for count data in Section~\ref{sec:VSTs}.
We then derive two new dissimilarity measures with the KL divergence and the empirical Bayes estimators in Section~\ref{sec:KL}.
Secondly, we propose an index which evaluates the discriminative ability of a dissimilarity measure (Section~\ref{sec:index}) and compare different dissimilarity measures according to the proposed index.
It is shown that, when applied to small counts, the Euclidean distance of the transformed data exhibits better discriminative ability than the original Euclidean distance, although the corresponding VST is unable to stabilize the variances (Section~\ref{sec:compare_VSTs}).
More importantly, the measure obtained with the NB assumption is expected to perform the best when used for separating different distributions of small count data (Section~\ref{sec:compare_VSTs_KL}). 
Lastly, we present the experimental results of representative DR methods with different measures on both real and simulated datasets (Section~\ref{sec:experiments}). 

\section{Dissimilarity measures for count data}
\label{sec:VSTs_KL}
In this section, we first present widely used VSTs for count data and then derive two new dissimilarity measures with the KL divergence and the empirical Bayes estimators for small counts. 

\subsection{Variance-stabilizing transformations (VSTs)}
\label{sec:VSTs}
A VST is a data transformation that applies to data such that the variance of the transformed data is independent of their mean. Most VSTs for count data are developed by assuming data follow an either Poisson or NB distributions~\citep{VST3,VST}.
Let $y$ be the raw counts. The square root transformation
\begin{equation}
    g_r(y) = \sqrt{y+\frac{3}{4}}
\end{equation}
is a popular technique for stabilizing the variance of a Poisson random variable. For an NB random variable $y$ which counts the number of successes and has the PMF $\binom{y+r-1}{y} (1-p)^r p^{y}$, where $p$ is the probability of success and $r$ represents the number of failures, a prevalent transformation is
\begin{equation}
    g_{\mathrm{asin}}(y) =\arcsinh\sqrt{\frac{y+\frac{3}{8}}{r - \frac{3}{4}}},
\end{equation}
where $\arcsinh$ is the inverse hyperbolic sine function.
Since $g_{\mathrm{asin}}(y)$ requires an approximate knowledge of $r$ and in some cases it cannot be estimated well enough, a simpler logarithm transformation with a pseudocount $1$ is preferred in practice, which is given by
\begin{equation}
    g_{\mathrm{log}}(y) = \mathrm{log}(y+1).
\end{equation}
As mentioned before, these transformations fail to stabilize the variance of small counts. Thus, there is no guarantee that the Euclidean distances of the data transformed from raw counts by these VSTs perform well on small counts. 

\subsection{Two new dissimilarity measures developed with KL divergence}
\label{sec:KL}
The KL divergence is a statistical measure of how one probability distribution is different from a second, reference probability distribution~\citep{kullback}. For discrete probability distributions $P$ and $Q$ defined on the same probability space $\mathbf{Z}$, the KL divergence is defined as
$
    D_{\mathrm{KL}}(P \mid Q) = \sum_{z \in \mathbf{Z}} P(z) \mathrm{log}\frac{P(z)}{Q(z)}.
$
The KL divergence for continuous random variables can be defined similarly by replacing the sum with the integral. 
For a pair of univariate normal distributions, $P:\mathcal{N}(\mu_{x}, \sigma^2)$ and $Q:\mathcal{N}(\mu_{y}, \sigma^2)$, we have 
$D_{\mathrm{KL}}
     \left [ \mathcal{N}(\mu_{x}, \sigma^2) \mid \mathcal{N}(\mu_{y}, \sigma^2) \right ]
= \frac{(\mu_{x} - \mu_{y})^2}{2\sigma^2}$.
The squared Euclidean distance $D_{E}^2$ between two vectors $\mathbf{x}=[x_1,\ldots,x_p]^T, \mathbf{y}=[y_1,\ldots,y_p]^T \in \mathbb{R}^p$ is equivalent to the sum of the KL divergence between two univariate normal distributions across dimensions up to a constant $\frac{1}{2\sigma^2}$ and the mean values of the distributions are estimated by the maximum likelihood estimators (MLEs). This equivalence is shown by the following equation:
\begin{align*}
\begin{split}
    &\sum_{i=1}^{p}\hat{D}_{\mathrm{KL}}
    \left [ \mathcal{N}(\mu_{ix}, \sigma^2) \mid \mathcal{N}(\mu_{iy}, \sigma^2)\right ]
     = \sum_{i=1}^{p}\frac{(\hat{\mu}_{ix} - \hat{\mu}_{iy})^2}{2\sigma^2} \\
    & = \sum_{i=1}^{p}\frac{(x_i - y_i)^2}{2\sigma^2},
\end{split}
\end{align*}
where $x_i$ and $y_i$ are the MLEs of mean parameters of the normal distributions on the $i$-th dimension of $\mathbf{x}$ and $\mathbf{y}$, respectively, when there is only one realisation observed for each distribution. 

Stimulated by the equivalence between $D_{E}^2$ and the KL divergence, we propose to quantify the pairwise dissimilarity for count data with the KL divergence.
To calculate the KL divergence, the distribution type and the corresponding parameter values are required to be specified. 
For the distribution type, we assume the observed data follow either Poisson or NB distributions, which are commonly used for modelling count data. 
Regarding the parameter estimation, a straightforward estimator is the MLE. However, the use of MLE incurs a numerical problem in practice. 
To clarify this problem, we first derive two dissimilarity measures with the MLEs for Poisson and NB distributions, respectively.  
Suppose $x_i$ and $y_i$ follow $\mathrm{Pois}(\lambda_{ix})$ and $\mathrm{Pois}(\lambda_{iy})$, respectively. The respective MLEs of $\lambda_{ix}$ and $\lambda_{iy}$ are $x_i$ and $y_i$. The KL divergence between $\mathbf{x}$ and $\mathbf{y}$ with these MLEs is thus
\begin{equation}\label{eq:kl_mle_poi}
\begin{split}
    &\sum_{i=1}^{p}\hat{D}_{KL}
    \left [ \mathrm{Pois}(\lambda_{ix})
    \mid
    \mathrm{Pois}(\lambda_{iy})\right ] 
     = \sum_{i=1}^{p} \left[y_i - x_i \right.\\
    &+\left. x_i\log \frac{x_i}{y_i} \right].
\end{split}
\end{equation}
Analogously, we suppose $x_i$ and $y_i$ follow NB$(r,p_{ix})$ and NB$(r,p_{iy})$, respectively, with known $r$. 
Note that there are multiple definitions of the NB distribution and we use the following ones: the PMF of NB$(r,p_{ix})$ for $x_i$ is $\binom{x_i+r-1}{x_i} (1-p_{ix})^r p_{ix}^{x_i}$ and similarly for NB$(r,p_{iy})$,
%that of NB$(r,p_{iy})$ for $y_i$ is $\binom{y_i+r-1}{y_i} (1-p_{iy})^r p_{iy}^{y_{i}}$, 
where $p_{ix}$ and $p_{iy}$ are the probabilities of success, $r$ represents the number of failures, and $x_i$ and $y_i$ count the numbers of successes.
The respective MLEs of $p_{ix}$ and $p_{iy}$ are $\frac{x_i}{x_i+r}$ and $\frac{y_i}{y_i+r}$.
%$\mathrm{log}\frac{x_i}{x_i+r}$ and $\mathrm{log}\frac{y_i}{y_i+r}$. 
The KL divergence between $\mathbf{x}$ and $\mathbf{y}$ with the MLEs is given by
\begin{equation}\label{eq:kl_mle_nb}
\begin{split}
    &\sum_{i=1}^{p}\hat{D}_{KL}
    \left [ \mathrm{NB}(r,p_{ix})
    \mid
    \mathrm{NB}(r,p_{iy})\right ] \\
    & = \sum_{i=1}^{p}
     r\mathrm{log}\frac{y_i+r}{x_i+r} + x_i \mathrm{log}\frac{x_i(y_i+r)}{y_i(x_i+r)}.
\end{split}
\end{equation}
The dissimilarity measures presented in Equation~(\ref{eq:kl_mle_poi}) and Equation~(\ref{eq:kl_mle_nb}) both involve the logarithm terms, and thus zeros in count data would result in the numerical problem.
Further, since the MLEs are close to the true values of parameters only if the number of observations is sufficiently large, the MLE calculated from one observation respectively are unreliable, so are $\hat{D}_{KL}
    \left [ \mathrm{Pois}(\lambda_{ix})
    \mid
    \mathrm{Pois}(\lambda_{iy})\right ]$
and $\hat{D}_{KL}
    \left [ \mathrm{NB}(r,p_{ix})
    \mid
    \mathrm{NB}(r,p_{iy})\right ]$.

To address these issues, we propose to use the empirical Bayes estimators rather than the MLEs.
The conjugate priors of Poisson and NB distributions are employed for estimating the parameters ($\lambda_{ix},\ \lambda_{iy}, \ p_{ix},\ p_{iy}$). In addition, the hyperparameters of these priors are learned from data themselves, sidestepping the difficulty of specifying proper priors to some degree.  
Concretely, we specify a Gamma prior distribution $G(m_i, 1)$, where the shape parameter $m_i$ is the mean value of the $i$-th dimension across all data points and the other parameter is the scale parameter, for the Poisson means ($\lambda_{ix},\ \lambda_{iy}$). For the probability parameters of NB distributions ($p_{ix},\ p_{iy}$), we specify a Beta prior distribution $B(m_i, r)$ for them. 
Note that the mean value can be thought of an additional observation.
With the priors, we obtain the posterior distribution of $\lambda_{ix}$ is $G(m_i+x_i, \frac{1}{2})$ and that of $\lambda_{iy}$ is $G(m_i+y_i, \frac{1}{2})$. 
The posterior means, which are $\frac{m_i+x_i}{2}$ and $\frac{m_i+y_i}{2}$, respectively, are used as the estimated distributional parameters. Analogously, we obtain the posterior mean $\frac{m_i+x_i}{m_i+x_i+2r}$ from the posterior distribution $B(m_i+x_i, 2r)$ and $\frac{m_i+y_i}{m_i+y_i+2r}$ from $B(m_i+y_i, 2r)$, as the estimated distributional parameters for NB distributions. Now we obtain the KL divergence between $\mathbf{x}$ and $\mathbf{y}$ with the Bayes estimators (posterior means): 
\begin{equation}\label{eq:d_p_and_nb}
\begin{split}
    &\hat{D}_{KL}^{Bayes}
    \left [ \mathrm{Pois}(\lambda_{ix})
    \mid
    \mathrm{Pois}(\lambda_{iy})\right ] \\
    & = \frac{1}{2}
     \sum_{i=1}^{p} y_i - x_i + (x_i+m_i) \log \frac{x_i+m_i}{y_i+m_i},
    \\
    &\hat{D}_{KL}^{Bayes}
    \left [ \mathrm{NB}(r,p_{ix})
    \mid
    \mathrm{NB}(r,p_{iy})\right ]\\
    &=  \sum_{i=1}^{p}
   \left[ r\mathrm{log}\frac{y_i+m_i+2r}{x_i+m_i+2r} \right.\\
   &+ \left.\frac{x_i+m_i}{2} \mathrm{log}\frac{(x_i+m_i)(y_i+m_i+2r)}{(y_i+m_i)(x_i+m_i+2r)}
    \right].
\end{split}
\end{equation}
The logarithm terms in Equation~(\ref{eq:d_p_and_nb}) are well defined for $m_i>0$, which is easily satisfied in practice as only meaningless features in the form of all zeros have $m_i=0$.
Owing to the asymmetry of the KL divergence, we propose to use 
\begin{equation}
\begin{split}
   &D_{P}^2 = 
   \hat{D}_{KL}^{Bayes}
   \left [ \mathrm{Pois}(\lambda_{ix})
   \mid
   \mathrm{Pois}(\lambda_{iy})\right ] \\
   &+
   \hat{D}_{KL}^{Bayes}
   \left [ \mathrm{Pois}(\lambda_{iy})
   \mid
   \mathrm{Pois}(\lambda_{ix})\right ] \\
    &D_{NB}^2 = 
    \hat{D}_{KL}^{Bayes}
    \left [ \mathrm{NB}(r,p_{ix})
    \mid
    \mathrm{NB}(r,p_{iy})\right ]\\
    &+
   \hat{D}_{KL}^{Bayes}
   \left [ \mathrm{NB}(r,p_{iy})
   \mid
   \mathrm{NB}(r,p_{ix})\right ] 
   \end{split}
\end{equation}
to measure the pairwise dissimilarity for count data. 
Table~\ref{tb:sum_measures} lists the dissimilarity measures that we take into account in this paper. Note that for $D_P$ and $D_{NB}$ we ignore their multiplicative constant $\frac{1}{2}$ for conciseness. 

\begin{table}[t]
\centering
\caption{Dissimilarity measures and their equations.}\label{tb:sum_measures}
\resizebox{1\linewidth}{!}
{\begin{tabular}{ll}
\hline
Measure                     & Equation                                    \\ \hline
$D_E(\mathbf{x},\mathbf{y})$              
& $\left [ \sum_{i=1}^{p} (x_i  -y_i)^2 \right ]^{\frac{1}{2}}$                     \\ 
\hline
$D_r(\mathbf{x},\mathbf{y})$              
& $\left [ \sum_{i=1}^{p} (g_r(x_i)  -g_r(y_i))^2\right ]^{\frac{1}{2}}$         \\ 

$D_{\mathrm{asin}}(\mathbf{x},\mathbf{y})$         
& $\left [ \sum_{i=1}^{p} (g_\mathrm{asin}(x_i)  -g_\mathrm{asin}(y_i))^2\right ]^{\frac{1}{2}}$    \\ 

$D_{\mathrm{log}}(\mathbf{x},\mathbf{y})$ 
& $\left [ \sum_{i=1}^{p} (g_{\mathrm{log}}(x_i)  -g_{\mathrm{log}}(y_i))^2\right ]^{\frac{1}{2}}$                            \\
\hline
$D_{P}(\mathbf{x},\mathbf{y})$                                   
& $\left [ \sum_{i=1}^{p} (\mathrm{log}(x_i +m_i) - \mathrm{log}(y_i+m_i))(x_i - y_i) \right ]^{\frac{1}{2}}$                     \\ 

$D_{NB}(\mathbf{x},\mathbf{y})$                                  
&$\left [ \sum_{i=1}^{p} \left(\mathrm{log}\frac{x_i+m_i}{x_i+m_i+2r}  - \mathrm{log}\frac{y_i+m_i}{y_i+m_i+2r}\right)(x_i - y_i) \right ]^{\frac{1}{2}}$ \\ \hline
\end{tabular}}
\end{table}



\section{Comparison of dissimilarity measures for high-dimensional small counts}
\label{sec:comparison_index}
In this section, we compare different measures listed in Table~\ref{tb:sum_measures}, according to their abilities to distinguish distributions that tend to produce small counts. First, we propose an index to quantify the discriminative abilities of different dissimilarity measures. Then, based on the proposed index, we investigate and compare the statistical behaviours of different measures when the dimension is high and the count data consist of many zeros.

\subsection{Evaluation index}
\label{sec:index}
The main practical goal of DR is to eliminate noisy or uninformative dimensions of high-dimensional data and assist downstream classification/clustering algorithms in uncovering meaningful classes/clusters in the data. 
Different classes/clusters of count data can be characterized by different distributions, and thus a dissimilarity measure that distinguishes these distributions well could benefit the downstream analysis of the data when integrated into standard DR approaches. 
In this subsection, we propose an index to evaluate how well a dissimilarity measure separates those data points generated from different distributions and groups those from the same distribution.
The definition of the proposed index is given in Definition~\ref{def:R}.

\begin{definition}\label{def:R}
Suppose there are two count data distributions, denoted by $F_{\mathbf{x}}$ and $F_{\mathbf{y}}$, respectively. Let $S_X = \left \{ \mathbf{x}_1,\dots, \mathbf{x}_{n_x} \right \}$ be the set of samples generated from $F_{\mathbf{x}}$ and $S_Y = \left \{ \mathbf{y}_1,\dots, \mathbf{y}_{n_y} \right \}$ the set of samples from $F_{\mathbf{y}}$.
For a given dissimilarity measure $D(\cdot,\cdot)$, the proposed index $R \left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$ is defined as
\begin{equation}\label{eq:R}
    \frac{\sum\limits_{\mathbf{x} \in S_X, \mathbf{y} \in S_Y } D^2(\mathbf{x},\mathbf{y})/(n_x n_y)} {\sum\limits_{\mathbf{x}_i,\mathbf{x}_j \in S_x,
    \mathbf{x}_i \neq \mathbf{x}_j} \!\frac{D^2(\mathbf{x}_j,\mathbf{x}_i)}{(n_x-1)n_x}
    \!+\! \sum\limits_{\mathbf{y}_i,\mathbf{y}_j \in S_y,
    \mathbf{y}_i \neq \mathbf{y}_j} \!\frac{D^2(\mathbf{y}_j,\mathbf{y}_i)}{(n_y-1)n_y}}.
\end{equation}
\end{definition}
Note that here we consider only two distributions for simplicity, and to facilitate the following analysis we use the squared dissimilarity function. In the following, the subscript $*$ of $R_*\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$ will be that of the corresponding dissimilarity measure.
The mathematical objectives of many DR approaches are to preserve the global or local proximity of high-dimensional data, and the proposed index suits them in that $R\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$ assesses simultaneously the variation between data points from the same distribution (local proximity) and the separation between data points from different distributions (global proximity).
$R\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)>1$ implies that using the corresponding dissimilarity measure $D(\cdot,\cdot)$ makes the separation between $F_{\mathbf{x}}$ and $F_{\mathbf{y}}$ greater than the within-distribution variation.
By construction, a higher value of $R\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$ would tend to indicate more powerful and robust discriminative ability of the corresponding measure in the presence of noisy dimensions, which possibly reduce the between-distribution separation and increase the within-distribution variation.

Before we dive into the comparison of measures using $R\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$, the statistical behaviour of $R\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$ in the high-dimensional space should be clarified. 
Proposition~\ref{prop:1} presents the behaviour of $R\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$ for the dissimilarity functions in a generic form: $D^2(\mathbf{x}, \mathbf{y})=\sum_{i=1}^{p}D^2(x_i,y_i) = \sum_{i=1}^{p}\left [ f(x_i) - f(y_i) \right ] \left [ g(x_i) - g(y_i) \right ]$, which covers all the measures presented in Table~\ref{tb:sum_measures}.
Proposition~\ref{prop:1} shows that $R\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$ moves toward a constant as dimension $p$ grows, irrespective of the number of samples from distributions.
The covariance of two increasing functions ($f$,$g$) of a random variable is positive~\citep{covariance}, and thus the constant which $R\left(F_{\mathbf{x}},F_{\mathbf{y}}\right)$ converges to would be greater than $\frac{1}{2}$ iff $\left [\mathrm{E}f(x) - \mathrm{E}f(y) \right ] \left [ \mathrm{E}g(x) - \mathrm{E}g(y) \right ]>0$, which is readily satisfied in practice.
The convergence still holds under some mild conditions, such as with dependent dimensions and non-identical distributions.

\begin{prop}\label{prop:1}
Suppose points in $S_X \cup S_Y$ are independent, and each coordinate of $\mathbf{x}$, $\mathbf{y}$ in $S_X$ and $S_Y$ are independently drawn from $1$-dimensional non-degenerate data distributions $F_x$ and $F_y$, respectively. 
For $D^2(\mathbf{x}, \mathbf{y})=\sum_{i=1}^{p}D^2(x_i,y_i) = \sum_{i=1}^{p}\left [ f(x_i) - f(y_i) \right ] \left [ g(x_i) - g(y_i) \right ]$ with $x_i \sim F_x$, $y_i \sim F_y$, where $f(\cdot)$ and $g(\cdot)$ are predetermined functions, if $ \mathbb{E}[D^2(x,y)]$, $ \mathbb{E}[D^2(x,\tilde{x})]$ and $ \mathbb{E}[D^2(y,\tilde{y})]$ exist for independent samples $\tilde{x},x \sim F_x$, $\tilde{y},y \sim F_y$, we have
\begin{align*}
\begin{split}
     &{R}_{D}\left(F_x, F_y\right) \overset{prob}{\rightarrow}
     \frac{1}{2} 
     + \frac{1}{2} \frac{\left [\mathrm{E}f(x) - \mathrm{E}f(y) \right ] \left [ \mathrm{E}g(x) - \mathrm{E}g(y) \right ]}
    {\mathrm{Cov}\left (f(x), g(x) \right ) 
    +\mathrm{Cov}\left (f(y), g(y) \right )},
\end{split}
\end{align*}
\end{prop}
where $\overset{prob}{\rightarrow}$ denotes the convergence in probability as the dimension $p$ goes to infinity.



\subsection{Compare the Euclidean distances w/o VSTs}
\label{sec:compare_VSTs}
In the following, we compare $D_E$ of original data with the Euclidean distances of the transformed data, according to their behaviours when dealing with small counts in the high-dimensional space; that is, we compare them in terms of the respective constants that their $R\left(F_{x},F_{y}\right)$'s converge to as the dimension diverges to infinity.
We first examine the discriminative ability of $D_E$ when count data are small. Corollary~\ref{cro:euc} provides the sufficient and necessary condition for $R_E \left(F_x, F_y\right)\overset{p}{\rightarrow} c_E>1$. This condition suggests that for any pairs of Poisson distributions that generate small counts with mean values less than $1$, we obtain $c_E<1$; that is, $D_E$ cannot distinguish the two distributions well. Therefore, $D_E$ is expected to perform poorly when handling small counts. An example showing $D_E$ is unable to distinguish two Poisson distributions with different patterns of small counts is provided in Section~\ref{secs:example} of Supplementary Material. 

\begin{corollary}\label{cro:euc}
With the same assumptions and notation as those in Proposition~\ref{prop:1}, for $D(\cdot, \cdot) = D_E(\cdot, \cdot)$, we have
\begin{enumerate}
    \item $R_E\left(F_x, F_y\right) \overset{prob}{\rightarrow} c_E \geq \frac{1}{2}$ for some constant $c_E$. The equality holds iff $\mathbb{E}(x) = \mathbb{E}(y)$ for $x \sim F_x$, $y \sim F_y$. 
    \item $c_E> 1$ iff $\left [ \mathbb{E}(x) \!-\!  \mathbb{E}(y) \right ]^2 > \mathrm{Var}\left ( x \right ) \!+\! \mathrm{Var}\left ( y \right )$.
\end{enumerate}
\end{corollary}


We then investigate the behaviours of $R\left(F_{x},F_{y}\right)$'s of the Euclidean distances based on VSTs when either $F_{x}$ or $F_{y}$ generates small counts. Without loss of generality, we assume $F_{x}$ produces small counts and $F_{y}$ is an arbitrary distribution. 
Suppose there is a VST characterized by an increasing transformation function $g(\cdot)$, such that $g(y)\geq 0$ for $y\geq0$. Note that $g(\cdot)$ covers $g_r(y)$, $g_{\mathrm{asin}}(y)$, and $g_{\mathrm{log}}(y)$.
Let $D_E$ of the data transformed from raw counts by $g(\cdot)$ be $D_g$ and the corresponding index $R_g\left(F_{x},F_{y}\right)$.
Corollary~\ref{cro:VSTs} provides the difference between $c_g$ and $c_E$ when the proportion of zeros of each data point in $S_X$ moves toward $1$.
It shows that, as $\mu_x$ approaches $0$, $D_g$ is better suited for distinguishing data points than $D_E$ iff
$\frac{\left [g(0) - \mathrm{E}g(y) \right ]^2}{\mathrm{Var}\left[ g(y)\right]}
-\frac{\mathrm{E}^2\left (y \right )}
{\mathrm{Var}\left (y \right )} >0 $.


\begin{corollary}\label{cro:VSTs}
Suppose that $x$ and $y$ are non-negative random variables. Let the expectation of $F_x$ be $\mu_x$. With the same assumptions and notation as those in Proposition~\ref{prop:1}, we have  
\begin{align*}
    \lim_{\mu_x \rightarrow 0}\left(c_g - c_E\right)
    =
    \frac{1}{2}
     \left[\frac{\left [g(0) - \mathrm{E}g(y) \right ]^2}{\mathrm{Var}\left[ g(y)\right]}
     -\frac{\mathrm{E}^2\left (y \right )}
     {\mathrm{Var}\left (y \right )} \right],
\end{align*}
where $c_g$ and $c_E$ are the constants that $\text{R}_g\left(F_{x},F_{y}\right)$ and $\text{R}_E\left(F_{x},F_{y}\right)$ approach, respectively, as the dimension $p$ goes to infinity.
\end{corollary}

To illustrate the advantages of applying VSTs to small counts, we obtain values of $\lim\limits_{\mu_x \rightarrow 0}c_g= \frac{1}{2} + \frac{1}{2}\frac{\left [g(0) - \mathrm{E}g(y) \right ]^2}{\mathrm{Var}\left[ g(y)\right]}$ for Poisson distributions $F_y$'s with different mean values and transformation functions by numerical computation. 
Figure~\ref{fig:c_g} supplies the numerical results and shows that $g_r$, $g_{\mathrm{log}}$, and $g_{\mathrm{asin}}$ always result in a $c_g$ that is no less than $c_E$.
In particular, it is observed from Figure~\ref{fig:c_g}(a) that $c_{g}$'s exceed $1$ when the Poisson mean is higher than $0.8$, indicating that the corresponding measures distinguish better between data points with large proportions of zeros compared with $D_E$. Note that we assign a large value to $r$ in $c_{g_{\mathrm{asin}}}$ ($r = 1000$) since an arbitrary $\text{NB}(r,p)$ approximates a Poisson distribution when $r$ approaches infinity.
The above analysis suggests that, although the VSTs are unable to stabilize the variances of small count data, they improve the discriminative ability over $D_E$. 
Proofs of Proposition~\ref{prop:1} and Corollary~\ref{cro:VSTs} are provided in Section~\ref{secs:proofs} of Supplementary Material.


\begin{figure}[t]
    \centering
    \subfloat[]{\includegraphics[width=0.492\linewidth]{figures/c_g.pdf}} ~
    \subfloat[]{\includegraphics[width=0.48\linewidth]{figures/c_g_1.pdf}}
    \caption{$c_g$ for different Poisson distributions and different transforms.}
    \label{fig:c_g}
\end{figure}



\subsection{Compare the two proposed measures with other dissimilarity measures}
\label{sec:compare_VSTs_KL}
In the following, we will compare the proposed measures ($D_P$,$D_{NB}$) with the other dissimilarity measures in terms of the proposed index $R \left(F_{x},F_{y}\right)$ computed in the high-dimensional space. 
Note that the estimate $\hat{R}(F_x,F_y)$ would be close enough to the constant that $R(F_x,F_y)$ approaches as long as the dimension is high enough. 
Take a pair of distributions ($F_x$, $F_y$) and a pair of measures $(D_{NB}, D_{E})$ for example, we believe $D_{NB}$ is superior to $D_{E}$ for distinguishing between $F_x$ and $F_y$ if $\hat{R}_{NB}(F_x,F_y) > \hat{R}_{E}(F_x,F_y)$. 
Further, to thoroughly evaluate their discriminative abilities for a specific distribution type, we compare their performances in terms of the fraction that $\hat{R}_{NB}(F_x,F_y)>\hat{R}_{E}(F_x,F_y)$ for different configurations of parameters. The fraction greater than $0.5$ suggests $D_{NB}$ is better suited for distinguishing this distribution type than $D_{E}$ and vice versa. 
The distributions types ($F_x$, $F_y$) taken into account are the broadly used Poisson and NB distributions. 
$F_x$ and $F_y$ are of the same distribution type but with different parameter configurations.
It is worth mentioning that $R_{P}(F_x, F_y)$ and $R_{NB}(F_x, F_y)$ are the same when data are Poisson-distributed, because $D_{NB}$ approaches $D_{P}$ when the dispersion parameter $r$ goes to infinity. We thus exclude $D_{NB}$ from the comparison for Poisson-distributed data. 
More details on the simulations and the calculation of different measures are presented in Section~\ref{secs:sim_details} of Supplementary Material.

\begin{table}[t]
\centering
\caption[Fraction that $\hat{R}(F_x,F_y)$ of a measure is greater than that of another measure for Poisson distributions.]{Fraction that $\hat{R}(F_x,F_y)$ of a measure is greater than that of another measure for Poisson distributions. The value in entry $(i,j)$ represents the fraction that $\hat{R}$ of the measure on the $i$-th row is greater than that of the measure on the $j$-th column. Top two measures are shown in bold.}
\label{tb:frac_poisson}
\resizebox{0.9\linewidth}{!}
{
\begin{tabular}{llllll | l}

\hline
Measures 
& $D_E$  & $D_{r}$  & $D_{\mathrm{asin}}$
& $D_{\mathrm{log}}$ & $D_{P}$ & Ave\\
\hline
$D_E$ &- &0.060 & 0.060 & 0.060 & 0.040  &0.055\\
$D_{r}$&0.940 &- &0.010 &0.080 &0.080 &0.295 \\
$D_{\mathrm{asin}}$
&0.940 &0.940 &- &0.010 &0.010 &0.520 \\
$D_{\mathrm{log}}$
&0.940  &0.900 &0.900 &- &0.520 &\textbf{0.815}\\
$D_{P}$
&0.960 &0.920 &0.900 &0.480 &- &\textbf{0.815} \\

\hline
\end{tabular}}
\end{table}


\begin{table}[t]
\centering
\caption{Fraction that $\hat{R}(F_x,F_y)$ of a measure is greater than that of another measure for NB distributions.}
\label{tb:frac_nb}
\resizebox{0.9\linewidth}{!}
{
\begin{tabular}{lllllll | l}

\hline
Measures 
& $D_E$  & $D_{r}$  & $D_{\mathrm{asin}}$
& $D_{\mathrm{log}}$ & $D_{P}$ &$D_{NB}$ & Ave\\
\hline

$D_E$ &- &0.240 &0.280 &0.242 &0.056 &0.050 &0.174\\
$D_{r}$
&0.760 &- &0.534 &0.498 &0.058 &0.054 &0.381
\\
$D_{\mathrm{asin}}$
&0.720 &0.466 &- &0.352 &0.106 &0.052 &0.339
\\
$D_{\mathrm{log}}$
&0.758 &0.502 &0.648  &- &0.112 &0.056 &0.415
\\
$D_{P}$
&0.944 &0.942 &0.894 &0.888 &- &0.070 &\textbf{0.748}
\\
$D_{NB}$
&0.950 &0.946 &0.948 &0.944 &0.930 &- &\textbf{0.944}    
\\ 

\hline
\end{tabular}}
\end{table}

After simulations, we get $\hat{R}(F_x,F_y)$'s for a wide scope of parameter configurations and different distribution types. As mentioned before, the measures are compared in terms of $\hat{R}(F_x,F_y)$'s. Table~\ref{tb:frac_poisson} and Table~\ref {tb:frac_nb} supply the comparisons of the measures when data follow Poisson and NB distributions, respectively. For small count data following Poisson distributions, $D_{P}$/$D_{NB}$ performs as well as $D_{\text{log}}$. 
Furthermore, $D_{NB}$ is superior to the other measures when count data are negative-binomially distributed. 
$D_{E}$, as anticipated, performs much worse than the other measures. 
The simulation results show that $D_{NB}$ is better than the other measures when distinguishing Poisson and NB distributions, and thus we expect that $D_{NB}$ outperforms the others when integrated into standard DR methods. 

Although it is shown that $D_{NB}$ is superior to the other measures and $D_{P}$ is the second-best measure, we find that the calculation of $m_i$ affects their discriminative performance. 
Specifically, if the value of $m_i$ is closer to the average of the expected values of the two distributions ($F_x$, $F_y$), $D_{P}$ and $D_{NB}$ would have a better discriminative ability. 
We reason that the mean of the expected values can be regarded as a typical value informative to the parameter estimation when used in the priors and thus is beneficial for the calculation of the KL divergence. 

\begin{table}[t]
\centering
\caption{Real scRNA-seq datasets used in this paper\label{tb:count_data}.}
\resizebox{1\linewidth}{!}{
\begin{tabular}{lrrrl}
\hline
Dataset  & \#clusters & \#cells & \#genes & prop of zeros \\
\hline
sc-CEL-seq2~\citep{Tian2019}  &3  &274  &22060   &0.678 \\
sc-CEL-seq2-5cl-p1~\citep{Tian2019} &5  &297  &15564  &0.608  \\  
sc-CEL-seq2-5cl-p2~\citep{Tian2019} &5  &307  &14078  &0.598 \\ 
sc-CEL-seq2-5cl-p3~\citep{Tian2019} &5  &305  &13426  &0.643  \\ 
Zheng8eq~\citep{Zheng}          &8  &3994 &13301  &0.957 \\
\hline
\end{tabular}}
\end{table}

\begin{table}[t]
\centering
\caption{Simulated scRNA-seq datasets used in this paper\label{tb:sim_data}.}
\resizebox{1\linewidth}{0.065\linewidth}{
\begin{tabular}{llrrrl}
\hline
Dataset & \#clusters & \#cells & \#genes & prop of zeros & corresponding real dataset\\
\hline
sim-Zheng8eq &8 &3994 &13770 & 0.969 & Zheng8eq~\citep{Zheng}\\

sim-manno-vm &5 &1977 & 19416 & 0.899 &manno-ESCs~\citep{manno}\\

sim-manno-ESCs &5&1715 &19459 & 0.834 &manno-ventral-midbrain~\citep{manno}\\
\hline
\end{tabular}}
\end{table}



\section{Experimental results}
\label{sec:experiments}
In this section, we present experimental results of representative DR methods with different dissimilarity measures on both real and simulated high-dimensional count datasets with large fractions of zeros. In addition, we compare the generalised PCAs (GPCAs)~\citep{GPCA} and NMF~\citep{NMF_bregman} with the proposed measure $D_{NB}$ in Section~\ref{secs:gpcas_nmf} of Supplementary Material. 

\subsection{Datasets}
The high-dimensional count data considered in this paper is the single cell RNA sequencing (scRNA-seq) data with unique molecular identifiers (UMI).
scRNA-seq data offer a unique opportunity to investigate the stochastic heterogeneity of complex issues at a near-genome-wide scale~\citep{gku555,shapiro2013single,molecularcell}. 
scRNA-seq data with UMI are often modelled by NB or Poisson distributions~\citep{Townes2019,Kim2020,Svensson2020} and exhibit large proportions of zero counts. 
We run experiments on both real and simulated scRNA-seq datasets. These datasets contain large proportions of zeros, ranging from $0.6$ to $0.97$. 
The characteristics of the real scRNA-seq datasets used in this paper are summarized in Table~\ref{tb:count_data}. 
Cluster labels provided by the real scRNA-seq datasets correspond to different cell types: labels for the datasets obtained from \cite{Tian2019} are assigned in terms of cancer cell lines and those for the Zheng8eq dataset based on the types of purified peripheral blood mononuclear cells. All the cluster labels reported in these datasets are defined independently of gene expression profiles and can be used as the ground-truth labels.
We simulate three additional scRNA-seq datasets by using the R's Splatter package~\citep{splatter} with most of the parameters learned from real datasets except for differential expression factors, which determine the difference between groups of cells and the number of clusters. The information of the simulated datasets and the corresponding real datasets used for simulations are summarized in Table~\ref{tb:sim_data}.


\subsection{Evaluation}
\textbf{Representative DR methods.} We compare different measures with three representative DR methods: PCA, GPLVM, and tSNE. 
GPLVM and PCA seek to retain the global structure of data by preserving the pairwise proximity for all pairs of data points, while tSNE predominantly preserves the local structure with the pairwise proximity amongst neighbouring data points. 
Therefore, DR results presented by PCA/GPLVM and tSNE, respectively, are complementary.
The proposed measures and $D_E$'s of the data transformed by the VSTs are compared based on their performance when integrated into the DR approaches. 
Note that $r$ in $D_{NB}$ and $D_{\mathrm{asin}}$ is set to the common NB dispersion parameter estimated by the R's edgeR package~\citep{edgeR}. As we treat mean values of features as pseudo observations when deriving the proposed measures, we replace each value $x$ in the data matrix with $\frac{x+m}{2}$, where $m$ is the mean value of the corresponding feature column, when estimating $r$.

\begin{figure}[t]
    \centering
     {\includegraphics[width=1\linewidth]{figures/GPLVM_VIS/sim_manno_vm_GPLVM_vis_VSTs.pdf}} 
    \caption{Visualization of the sim-manno-vm dataset obtained by GPLVMs with different measures. Different clusters are shown in different colours.}
    \label{fig:gplvm_vis_sim-manno-vm}
\end{figure}

\begin{figure}[t]
    \centering
     {\includegraphics[width=1\linewidth]{figures/PCA_VIS/sc_Celseq2_5cl_p1_pca_vis_VSTs.pdf}} 
    \caption{Visualization of the sc-CEL-seq2-5cl-p1 dataset obtained by PCA with different measures.}
    \label{fig:pca_vis_sc-CEL-seq2-5cl-p1}
\end{figure}

\textbf{Visualization.} As visualization is an important application of DR, we evaluate the DR methods with different dissimilarity measures by visually inspecting their DR results in a two-dimensional (2D) space.
A good visualization should exhibit well-separated groups of data. 

\textbf{Clustering.} Apart from visualization in a 2D space, DR techniques can also be used for improving clustering of high-dimensional data in real-world applications. For instance, high-dimensional scRNA-seq data are often projected into a low-dimensional space whose dimension could be greater than $2$, and clustering methods, such as $k$-means and hierarchical clustering, are performed on the dimension-reduced data to improve clustering~\citep{SUN2019,petegrosso2020machine}. Furthermore, it has been shown that applying $k$-means clustering in a PCA subspace can significantly improve clustering accuracy~\citep{ding2004k}. 
As clustering is an important downstream task to DR, we also evaluate the DR approaches with different measures based on the clustering performance in the dimension-reduced space. 

The $k$-means algorithm~\citep{kmeans,kmeans1} is used for inferring the cluster labels of data in the space of reduced dimension. The number of clusters in the $k$-means algorithm is set to be the ground truth. The clustering performance is assessed in terms of the adjusted rand index (ARI)~\citep{ari} between the cluster labels from the original publication/simulation and the inferred ones. The higher the ARI, the better the performance.
Normally, tSNE and GPLVM map high-dimensional data to a 2D space, and thus we consider only 2D projections when evaluating their clustering performance. 
Since the results of the tSNE algorithm could be variable, we replicate the procedure, of first performing tSNE and then applying $k$-means, for $10$ times on the datasets for a more reliable comparison.
More experimental details are provided in Section~\ref{secs:evaluation} of Supplementary Material.
\begin{figure*}[t]
    \centering
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/GPLVM_KMEANS/sc_Celseq2_5cl_p1_GPLVM_kmeans_VSTs.pdf}} ~
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/GPLVM_KMEANS/sc_Celseq2_5cl_p2_GPLVM_kmeans_VSTs.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/GPLVM_KMEANS/sc_Celseq2_5cl_p3_GPLVM_kmeans_VSTs.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/GPLVM_KMEANS/sce_sc_CELseq2_qc_GPLVM_kmeans_VSTs.pdf}} \\
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/GPLVM_KMEANS/full_Zhengmix8eq_GPLVM_kmeans_VSTs.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/GPLVM_KMEANS/sim_Zheng8eq_GPLVM_kmeans_VSTs.pdf}}
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/GPLVM_KMEANS/sim_manno_GPLVM_kmeans_VSTs.pdf}}~
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/GPLVM_KMEANS/sim_manno_vm_GPLVM_kmeans_VSTs.pdf}} 
    \caption{ARI of $k$-means with GPLVMs and different dissimilarity measures on the following datasets: (a) sc-CEL-seq2-5cl-p1, (b) sc-CEL-seq2-5cl-p2, 
    (c) sc-CEL-seq2-5cl-p3, (d) sc-CEL-seq2, 
    (e) Zheng8eq, (f) sim-Zheng8eq, 
    (g) sim-manno-ESCs, and (h) sim-manno-vm.}
    \label{fig:Kmeans_gplvm}
\end{figure*}




\begin{figure*}[t]
    \centering
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/tSNE_KMEANS/sc_Celseq2_5cl_p1_tSNE_kmeans_VSTs.pdf}}
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/tSNE_KMEANS/sc_Celseq2_5cl_p2_tSNE_kmeans_VSTs.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/tSNE_KMEANS/sc_Celseq2_5cl_p3_tSNE_kmeans_VSTs.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/tSNE_KMEANS/sce_sc_CELseq2_qc_tSNE_kmeans_VSTs.pdf}} \\
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/tSNE_KMEANS/full_Zhengmix8eq_tSNE_kmeans_VSTs.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/tSNE_KMEANS/sim_Zheng8eq_tSNE_kmeans_VSTs.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/tSNE_KMEANS/sim_manno_tSNE_kmeans_VSTs.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/tSNE_KMEANS/sim_manno_vm_tSNE_kmeans_VSTs.pdf}} 
    \caption{ARI of $k$-means with tSNE and different dissimilarity measures on the following datasets: (a) sc-CEL-seq2-5cl-p1, (b) sc-CEL-seq2-5cl-p2, 
    (c) sc-CEL-seq2-5cl-p3, (d) sc-CEL-seq2, 
    (e) Zheng8eq, (f) sim-Zheng8eq, 
    (g) sim-manno-ESCs, and (h) sim-manno-vm.}
    \label{fig:Kmeans_tsne}
\end{figure*}

\begin{figure*}[t]
    \centering
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/PCA_KMEANS/sc_Celseq2_5cl_p1_VSTs_kmeans.pdf}}
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/PCA_KMEANS/sc_Celseq2_5cl_p2_VSTs_kmeans.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/PCA_KMEANS/sc_Celseq2_5cl_p3_VSTs_kmeans.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/PCA_KMEANS/sce_sc_CELseq2_qc_VSTs_kmeans.pdf}} \\
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/PCA_KMEANS/full_Zhengmix8eq_VSTs_kmeans.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/PCA_KMEANS/sim_Zheng8eq_VSTs_kmeans.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/PCA_KMEANS/sim_manno_VSTs_kmeans.pdf}} 
     \subfloat[]{\includegraphics[width=0.24\linewidth]{figures/PCA_KMEANS/sim_manno_vm_VSTs_kmeans.pdf}} 
    \caption{ARI of $k$-means with PCA and different dissimilarity measures on the following datasets: (a) sc-CEL-seq2-5cl-p1, (b) sc-CEL-seq2-5cl-p2, 
    (c) sc-CEL-seq2-5cl-p3, (d) sc-CEL-seq2, 
    (e) Zheng8eq, (f) sim-Zheng8eq, 
    (g) sim-manno-ESCs, and (h) sim-manno-vm.}
    \label{fig:Kmeans_pca}
\end{figure*}



\subsection{Visualization}
In this subsection, we examine whether the application of $D_{NB}$ can produce better visualization. 

First, we visualize the dimension-reduced data obtained by GPLVMs with different measures in  
Figure~\ref{fig:gplvm_vis_sim-manno-vm} and Figures~\ref{fig:gplvm_vis_sc-CEL-seq2-5cl-p1}-\ref{fig:gplvm_vis_sim-manno-ESCs} of Supplementary Material. 
It is observed that GPLVMs with $D_{NB}$, $D_{r}$, $D_{\mathrm{asin}}$, and $D_{\mathrm{log}}$ perform equally well on most datasets except for the sim-manno-ESCs and sim-manno-vm datasets. For the sim-manno-ESCs dataset (Figure~\ref{fig:gplvm_vis_sim-manno-ESCs} of Supplementary Material), GPLVMs with $D_P$ and $D_{NB}$ display well-grouped data in the 2D space while those with the other measures fail to do so. Furthermore, only the GPLVM with $D_{NB}$ can distinguish different groups of data points on the sim-manno-vm dataset (Figure~\ref{fig:gplvm_vis_sim-manno-vm}).

Secondly, we compare the visualization results obtained by tSNE with different measures in Figures~\ref{fig:tsne_vis_sc-CEL-seq2-5cl-p1}-\ref{fig:tsne_vis_sim-manno-vm} of Supplementary Material. The tSNE algorithms with $D_{NB}$, $D_{r}$, $D_{\mathrm{asin}}$, and $D_{\mathrm{log}}$ produce well-distinguished groups of data on most datasets except for the Zheng8eq dataset (Figure~\ref{fig:tsne_vis_Zheng8eq}), the sim-manno-ESCs dataset (Figure~\ref{fig:tsne_vis_sim-manno-ESCs}) and the sim-manno-vm dataset (Figure~\ref{fig:tsne_vis_sim-manno-vm}), where all the measures fail to recognize the groups of data.

Thirdly, by comparing PCA results with different measures in the 2D space (Figure~\ref{fig:pca_vis_sc-CEL-seq2-5cl-p1} and Figures~\ref{fig:pca_vis_sc-CEL-seq2-5cl-p2}-\ref{fig:pca_vis_sim-manno-vm} of Supplementary Material), we find that the PCA algorithms with $D_{NB}$, $D_{\mathrm{asin}}$, $D_{\mathrm{log}}$ produce more distinguished groups of data on the sc-CEL-seq2-5cl-p1 dataset (Figure~\ref{fig:pca_vis_sc-CEL-seq2-5cl-p1}) and the sc-CEL-seq2-5cl-p2 dataset (Figure~\ref{fig:pca_vis_sc-CEL-seq2-5cl-p2}). For the sc-CEL-seq2-5cl-p3 dataset (Figure~\ref{fig:pca_vis_sc-CEL-seq2-5cl-p3}), the PCA with $D_{\mathrm{log}}$ presents three distinguished groups while those with the other measures display only two groups. The PCA with $D_{NB}$ and $D_{\mathrm{asin}}$ can separate the groups in the sc-CEL-seq2 dataset (Figure~\ref{fig:pca_vis_sc-CEL-seq2}). The PCA algorithms with the proposed measures and VSTs perform equally well on the Zheng8eq dataset (Figure~\ref{fig:pca_vis_Zheng8eq}) and the sim-Zheng8eq dataset (Figure~\ref{fig:pca_vis_sim-Zheng8eq}). For the sim-manno-ESCs dataset (Figure~\ref{fig:pca_vis_sim-manno-ESCs}), the applications of $D_r$, $D_{P}$, and $D_{NB}$ lead to more distinguished groups in the data. Furthermore, it is observed in Figure~\ref{fig:pca_vis_sim-manno-vm} that only the PCA with $D_{NB}$ can separate groups of data to some degree on the sim-manno-vm dataset while those with the other measures fail to do so.

It is found that the GPLVM/PCA with $D_{NB}$ often presents distinguished groups of data in the 2D space while the tSNE with $D_{NB}$ fails to do so on some datasets. This difference may be due to the characteristics of the DR methods, but not the dissimilarity measures themselves. As mentioned before, GPLVM/PCA aims to preserve the global structure of data and tSNE predominantly preserves the local structure. Thus, tSNE may fail to preserve the global structure (inter-cluster proximity) due to its preference for the local proximity. In such cases, visualizing clusters with tSNE would not work well. Although \citet{art-tsne} suggest using informative initialization or multi-scale similarities to improve the preservation of the global structure, but we find that these strategies do not result in better-distinguished clusters in the 2D space for the Zheng8eq, sim-manno-ESCs and sim-manno-vm datasets. 

To sum up, the GPLVM and PCA with $D_{NB}$ can produce better visualization results than those with the other measures, while the tSNE with $D_{NB}$ may not distinguish clusters well due to its preference for the local structure of data. 



\subsection{Clustering results}
In the following, we compare different measures in terms of their clustering performance in the dimension-reduced spaces. The $k$-means clustering results with GPLVMs and different dissimilarity measures are shown in Figure~\ref{fig:Kmeans_gplvm}. It is observed that the GPLVM with $D_{NB}$ performs consistently well on most datasets except the sc-CEL-seq2-5cl-p2. Furthermore, the GPLVM with $D_{NB}$ obtains much higher values than the other measures based on VSTs on the sim-manno-ESCs dataset and the sim-manno-vm dataset.
To sum up, $D_{NB}$ outperforms the other measures when integrated into GPLVM.

Figure~\ref{fig:Kmeans_tsne} presents the $k$-means clustering results with tSNE and different measures. The tSNE with $D_{NB}$ performs comparably well on some datasets except for the Zheng8eq, sim-manno-ESCs, and sim-manno-vm datasets. The clustering performance of the tSNE with $D_{NB}$ are consistent with its visualization results. 
As we discussed before, its relatively weak performance on the Zheng8eq, sim-manno-ESCs, and sim-manno-vm datasets may be due to the tSNE's preference for the preservation of the local structure of data. 
Furthermore, the clustering performance of tSNE suggests that clustering on the outputs of DR techniques must be done with caution. DR approaches, such as non-linear tSNE, could be unsuccessful to preserve clusters and thus adversely affect the cluster analysis. 

It is observed in Figure~\ref{fig:Kmeans_pca}, $D_{NB}$ is superior to the other measures when combined with PCA, irrespective of the number of dimensions, on most datasets except for the sc-CEL-seq2-5cl-p3, Zheng8eq, and sim-Zheng8eq datasets. On these three datasets, $D_{NB}$ outperforms most measures when the dimension is greater than five. 


In summary, consistent with the visualization results, the clustering performance obtained by the representative GPLVM/PCA with $D_{NB}$ are superior to those with the other measures. 

\section{Conclusion}
This paper investigates how to perform DR for high-dimensional small count data. We propose a dissimilarity measure $D_{NB}$ that is well-suited for count data with many zeros. 
The statistical behaviours of different dissimilarity measures when the dimension is high enough are investigated in terms of a proposed index. It is found that the proposed measure $D_{NB}$ is superior to the other measures in the sense that it distinguishes data points from different distributions better. Consistent with the statistical comparison, the experimental results demonstrate that $D_{NB}$ enhances a variety of standard DR approaches. 


%\begin{contributions} % will be removed in pdf for initial submission,
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
%    Briefly list author contributions.
%    This is a nice way of making clear who did what and to give proper credit.

%    H.~Q.~Bovik conceived the idea and wrote the paper.
%    Coauthor One created the code.
%    Coauthor Two created the figures.
%\end{contributions}

\begin{acknowledgements} 
The authors thank the anonymous reviewers for helpful discussions and suggestions.
\end{acknowledgements}

\bibliography{ling_484}

\end{document}
