%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


%----------------------------------------
% my packages and commands

\usepackage{amsmath}
\usepackage{graphicx}
%\usepackage[title]{appendix}
\usepackage{bm}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{bbm}
\usepackage{verbatim}
% \usepackage{subfigure}
%\usepackage{subfig}
\usepackage{afterpage}
\usepackage{etoolbox}
%\usepackage{footbib}
\usepackage{float}
\usepackage{rotating}
\usepackage[inline]{enumitem}
\usepackage{diagbox}

\usepackage{multirow}
\usepackage{caption}
\usepackage[skip=0pt]{subcaption}
\newdimen\figrasterwd
\figrasterwd\textwidth

\newtheorem{lemma}{Lemma}
\DeclareMathOperator{\erf}{erf}
\usepackage{algorithmic}
\usepackage{algorithm}% http://ctan.org/pkg/algorithm
\setlength{\marginparwidth}{2cm}
%\usepackage[colorinlistoftodos]{todonotes}
\newcommand {\myvec}[1] {{\mbox{\boldmath $#1$}}}
\newcommand{\myx}{\myvec{x}}
\newcommand{\myX}{\myvec{X}}
\newcommand{\myy}{\myvec{y}}
\newcommand{\myY}{\myvec{Y}}
\newcommand{\myZ}{\myvec{Z}}
\newcommand{\myz}{\myvec{z}}
\newcommand{\myL}{\myvec{L}}
\newcommand{\myth}{\myvec{\theta}}
\newcommand{\mys}{\myvec{s}}
\newcommand{\tils}{\tilde{S}}
\newcommand{\prob}{\mathbb{P}}
\newcommand{\reals}{\mathbb{R}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\M}{\mathcal M}
%\usepackage[colorinlistoftodos]{todonotes}
\newcommand{\rev}[1]{{\color{black}{#1}}}
\newcommand{\del}[1]{{\color{red}{#1}}}
\newcommand{\snncomment}[1]{\todo{SNN: #1}}
\newcommand{\exval}{\mathbb{E}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\usepackage{hyperref}

\usepackage{titling}
\renewcommand\maketitlehooka{\null\mbox{}\vfill}
\renewcommand\maketitlehookd{\vfill\null}
%----------------------------------------






%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Multi-modal Differentiable Unsupervised Feature Selection}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<junchen.yang@yale.edu>?Subject=mmDUFS}{Junchen Yang}{}}
\author[2]{Ofir Lindenbaum}
\author[1,4,5]{Yuval Kluger}
\author[3]{Ariel Jaffe}
% Add affiliations after the authors
\affil[1]{%
    Interdepartmental Program in Computational Biology and Bioinformatics\\ Yale University\\
    New Haven, CT, USA
}
\affil[2]{%
    Faculty of Engineering, Bar-Ilan University, Israel
}
\affil[3]{%
    Department of Statistics and Data Science, Hebrew University of Jerusalem, Israel
  }
  \affil[4]{%
Applied Math Program, Yale University, New
Haven, CT, USA
}
\affil[5]{%
Department of Pathology, School of Medicine, Yale University, New Haven, CT, USA
  }
\begin{document}
\maketitle

\begin{abstract}
Multi-modal high throughput biological data presents a great scientific opportunity and a significant computational challenge. In multi-modal measurements, every sample is observed simultaneously by two or more sets of sensors. In such settings, many observed variables in both modalities are often nuisance and do not carry information about the phenomenon of interest. Here, we propose a multi-modal unsupervised feature selection framework: identifying informative variables based on coupled high-dimensional measurements. Our method is designed to identify features associated with two types of latent low-dimensional structures: (i) shared structures that govern the observations in both modalities, and (ii) differential structures that appear in only one modality. To that end,  we propose two Laplacian-based scoring operators. We incorporate the scores with differentiable gates that mask nuisance features and enhance the accuracy of the structure captured by the graph Laplacian. The performance of the new scheme is illustrated using synthetic and real datasets, including an extended biological application to single-cell multi-omics. 
\end{abstract}


\section{Introduction}\label{sec:intro}
In an effort to study biological systems, researchers are developing cutting-edge techniques that measure up to tens of thousands of variables at single-cell resolution. In recent years, research into the interplay between complex biological processes has inspired the development of multi-modal technologies that enable the simultaneous collection of measurements from two or more sets of sensors.  Examples of such multi-modal measurements include SHARE-seq \citep{ma2020chromatin}, DBiT-seq \citep{liu2020high}, CITE-seq \citep{stoeckius2017simultaneous}, etc., which have provided biological insights and advancements in applications such as transcription factor characterization \citep{joung2023transcription}, cell type identification in human hippocampus \citep{xiao2022spatially}, and immune cell profiling \citep{leblay2020cite}.

Multi-modal learning is a powerful tool widely used across multiple disciplines to extract latent information from high-dimensional measurements \citep{sun2013survey,yan2021deep}. Humans use complementary senses when attempting to ``estimate'' spoken words or sentences \citep{raij2000audiovisual}. For example, lip movements can help us distinguish between two syllables that sound similar. The same intuition has inspired statisticians and machine learning researchers to develop learning techniques that exploit information captured simultaneously by complementary measurement devices.

The applicability of multi-modal datasets in multiple domains, has motivated the development of computational approaches tailored to multi-modal settings. Algorithms such as Contrastive Language–Image Pre-training
(CLIP) \citep{radford2021learning} and Audioclip \citep{guzhov2022audioclip} have pushed the performance boundaries of machine learning for image, text, audio, analysis, and synthesis. The multi-modal data fusion task dates back to \citet{CCA1}, which proposed the celebrated Canonical Correlation Analysis (CCA). CCA has many extensions \citep{DCCA,lindenbaum2022lsparse} and applications in diverse scientific domains \citep{cca_bio,cca_fault}. Despite their tremendous success, classical or advanced multi-modal schemes are often 
unsuitable for analyzing biological data. The large number of nuisance variables, which often exceeds the number of measurements, often causes correlation-based methods to overfit. % which causes 

%lend themselves useless in biological systems. This is because high throughput biological data contains many nuisance variables, which may exceed the number of measurements, thus leading correlations-based methods to overfit.


To attenuate the influence of nuisance or noisy features, several authors proposed unsupervised feature selection (UFS) schemes \citep{solorio2020review}. 
UFS seeks small subsets of informative variables in order to improve downstream analysis tasks, such as clustering or manifold learning. Empirical results demonstrate that informative features are often smooth and reflect some latent structure \citep{degeest2018smoothness}. In practice, the smoothness of features can be evaluated based on how slowly they vary with respect to a graph \citep{he2005laplacian}. Follow-up works exploited this idea to identify informative features \citep{zhao2012spectral,shaham2021deep}. An alternative paradigm for UFS seeks subsets of features that can be used to reconstruct the entire data effectively \citep{balin2019concrete}.
%Emerging technologies in biology for measuring multimodal data include multi-omics data, such as scRNA-seq and ATAC-seq measured simultaneously. This multi-modal data integration can help understand the interplay between different biological systems \citep{subramanian2020multi,hu2021integration}. 


%However, there are still several challenges ahead of such applications. One challenge is properly combining the data from each modality to perform the joint analysis with different goals (e.g., discover shared underlying structures or structures that are only specific to each modality). In addition, certain modalities of single-cell multi-omic technologies might suffer from high-level noise, where the informative markers are often masked by the noisy features. Attenuating the influence of noisy or nuisance features is key for successfully performing joint data analysis of multiome data.
While most fusion methods focus on extracting information shared between modalities, 
we propose a multi-modal UFS framework to identify 
features associated both with structures that appear in both modalities, and
structures that are \textit{modality-specific}, and appear in only one modality.
%modality-specific latent structures, that appear Specifically, 
%we propose a multi-modal UFS framework to identify informative features that capture \textit{shared} or \textit{modality-specific} structures in high-dimensional datasets.
To capture the shared structure, we construct a symmetric shared graph Laplacian operator that enhances the shared geometry across modalities. We further propose differential graph operators that capture smooth structures that are not shared with the other modality. To perform multi-modal feature selection, we incorporate differentiable gates \citep{yamada2020feature} with the \textit{shared} and \textit{modality-specific} graph Laplacian scoring functions. This leads to a differentiable UFS scheme that attenuates the influence of nuisance features during training and computes a more accurate Laplacian matrix \citep{lindenbaum2021differentiable}.



Our contributions are four folds: (i) Develop a \textit{shared} and \textit{modality-specific} Laplacian scoring operators. (ii) Motivate our operators using a product of manifolds model. (iii) develop and implement a differentiable framework for multi-modal UFS. (iv) Evaluate the merits and limitations of our approach with synthetic and real data and compare it to existing schemes.



%To overcome these challenges, we propose XXX, a computational method that can learn both the shared underlying structure between modalities and the modality-specific structures for single-cell multi-omic data while simultaneously removing noisy features that are irrelevant to these structures. In this prospectus, we will first introduce the preliminaries, related works, and our proposed model mvDUFS in Section \ref{sec:prem_method}, then we will evaluate the performance of mvDUFS on synthetic examples in Section \ref{sec:simulation}. Lastly, XXX.

\section{Problem setting and preliminaries}
\label{sec:prem_method}
We are given two data matrices %sets of $n$ observations $$
%captured by two arrays of sensors. Let  
$\myX \in 
\R^{n\times d},\myY \in \R^{n\times m}$ whose rows contain $n$ observations captured simultaneously in two modalities. The two sets of observations can be, for example, two arrays of sensors, cameras with different angles, etc. We are interested in processing modalities with bijective correspondences, which implies that there is a registration between the observations in both modalities. 

%from the first  modality and $\myvec{Y} \in \mathbb{R}^{n\times m}$ be the data matrix from the second modality, where $n$ is the number of samples that have one-to-one correspondence between the two modalities, and $d$ and $m$ are the number of features in each modality.
Though the observations are high-dimensional, we assume that there are a small number of parameters governing the physical processes that underlies the data. These parameters can be continuous such as in a developmental process, or discrete - for example, when the observations are separated into distinct clusters. 
However, the latent structure in both modalities may not be identical. For example, the two sets of observations may be generated by sets of sensors with different resolutions or sensitivity. 
For illustration, consider the observations shown in Fig. \ref{fig:workflow} (left). Both  modalities follow a very similar tree structure. 
The bottom tree, however, has an additional bifurcating point that does not appear in the upper tree (green points). 

Thus, we assume the latent parameters in each modality can be partitioned into two components. The first,  denoted $\myvec{\theta}_s$, captures the structures shared by both modalities. 
The second, denoted $\myvec{\theta}_x$ for modality $\myX$, and $\myvec{\theta}_y$ for modality $\myY$, captures the modality-specific structures that only appear in one set of observations. For example, the additional branch in the bottom tree (modality $\myY$) in Fig. \ref{fig:workflow} is governed by a parameter in $\myvec{\theta}_{y}$. 
Thus, the observations $\myX$ and $\myY$ are nonlinear transformations of $\myvec{\theta}_{s},\myvec{\theta}_{x}$ and $\myvec{\theta}_{s},\myvec{\theta}_{y}$, respectively.
%We further assume that noisy features () such that the underlying structures are distorted or blurred.%For instance, in Fig. \ref{fig:workflow}, a common bifurcated tree is shared by both modality $\myvec{X}$ and $\myvec{Y}$, and we denote this tree as $\theta_s$.%To conclude, we denote $\myvec{X} = f(\theta_{s},\theta_{x})$ and $\myvec{Y} = g(\theta_{s},\theta_{y})$ where $f$ and $g$ are two nonlinear functions specific to each modality.

Many biological data modalities are high dimensional and contain noisy features, which hinders the discovery of the underlying shared or modality-specific structures. Here, our goal is to identify groups of features  associated with the shared structures $\myvec{\theta}_{s}$ (e.g., the groups of features that are smooth with respect to the shared bifurcated tree in Fig. \ref{fig:workflow}) and groups of features  associated with the modality-specific structures $\myvec{\theta}_{x}$ and $\myvec{\theta}_{y}$ (e.g., the features that are smooth with respect to the additional branch  of modality $\myvec{Y}$ in Fig. \ref{fig:workflow}). To achieve this goal, we compute two graphs that correspond to the two modalities. We use a spectral method to uncover the shared and graph-specific structures and apply a feature selection method to detect variables relevant to these structures. To better understand our approach, we  first introduce some preliminaries about graph representation in Sec. \ref{sec:graph_th}, and discuss related work on feature selection in Sec. \ref{sec:dufs}. 


%lastly introduce our proposed method in Section \ref{sec:mvdufs} for the multi modality settings.

%then introduce two graph operators to uncover the shared and different structures in Section \ref{sec:joint_op}. 

\begin{figure}[htb!] 

    \centering
    
    \includegraphics[width=0.45\textwidth]{Figs/workflow_v4}
    
    \caption{Overview of the goal: discovering features associated with shared and modality-specific latent structures}%
    
    \label{fig:workflow}
\end{figure}


\subsection{The graph Laplacian and Laplacian score}
\label{sec:graph_th}

A common assumption when analyzing high-dimensional datasets is that their latent, underlying structure can be approximated by a low dimensional manifold \citep{linderman2019fast,peterfreund2020local}. Methods for manifold learning are often based on graphs that capture the affinities between data points. Let $\myvec{x}^{(i)},\myvec{y}^{(i)} $ denote the $i$-th observation in the $\myX$ and $\myY$ modalities and let $\myvec{K}_x,  \myvec{K}_y$ be, respectively, their affinity matrices whose elements are computed by the following Gaussian kernel functions,
\begin{align*}
    {(\myvec{K}_{x})}_{i,j} &= \exp{\Big(-\frac{\|\myvec{x}^{(i)} - \myvec{x}^{(j)}\|^{2} }{2\sigma_x^2}\Big)}, \\
    {(\myvec{K}_{y})}_{i,j} &= \exp{\Big(-\frac{\|\myvec{y}^{(i)} - \myvec{y}^{(j)}\|^{2} }{2\sigma_y^2}\Big)},
\end{align*}
where $\sigma_x,\sigma_y$ are user-defined bandwidths that control the decay of each Gaussian kernel. Intuitively, the affinities decay exponentially with the distances between samples, thus capturing the local neighborhood structure in the high-dimensional space.

We compute the normalized Laplacian matrix by $\myvec{L}_{x} = \myvec{D}^{-\frac{1}{2}}_{x}\myvec{K}_{x}\myvec{D}^{-\frac{1}{2}}_{x}$, where $\myvec{D}_{x}$ is a diagonal matrix of row sums of $\myvec{K}_x$. Similarly, $\myvec{L}_y$ is computed for modality $\myvec{Y}$. An important property of the Laplacian matrix is that its eigenvectors corresponding to large eigenvalues reflect the underlying geometry of the data. The Laplacian eigenvectors are used for  many applications, including data embeddings \citep{belkin2003laplacian}, clustering \citep{von2007tutorial}, and feature selection \citep{he2005laplacian}. For the latter, a popular metric for unsupervised identification of informative features is the Laplacian Score (LS) \citep{he2005laplacian}, %computed by
%The LS measures the smoothness of each feature with respect to a graph-based representation of the data. 
%By building on the graph Laplacian's ability to capture the geometry of the data, the LS can be defined as:
\begin{equation}
    \myvec{f}^{T}\myvec{L}_{x}\myvec{f} = \sum_{i=1}^{n}\lambda_{i} (\myvec{f}^T\myvec{u}_i)^2,
    \label{eq:LS}
\end{equation}
where $\myvec{L}_{x} = \sum_{i=1}^{n}\lambda_{i} \myvec{u}_i \myvec{u}_i^T$ is the eigendecomposition of $\myvec{L}_{x}$ and $\myvec{f}$ is the normalized feature vector. Intuitively, when $\myvec{f}$ varies slowly with respect to the underlying structure of $\myvec{L}_{x}$, it will have a significant component projected onto the subspace of its top eigenvectors, and a higher score. %  and the corresponding Laplacian Score will be higher, and these features will be regarded as informative features. 



\subsection{Differentiable Unsupervised Feature Selection}
\label{sec:dufs}

%In the previous Section, we proposed operator $\myvec{P}_{\text{joint}}$ to learn the shared structures between modalities, and operator $\myvec{Q}_x$ and $\myvec{Q}_y$ to learn the difference.

 


A key limitation of the Laplacian score stems from the underlying assumption that the Laplacian matrix $\myvec{L}_{x}$ accurately reflects the latent structure of the data. This assumption, however,  may not be valid in the presence of many noisy features. In such cases the top eigenvectors of $\myvec{L}_{x}$ may be heavily influenced by noise and would not capture the underlying structure accurately. A recent work \citep{lindenbaum2021differentiable} addresses this problem by developing Differentiable Unsupervised Feature Selection (DUFS), a framework that estimates the Laplacian matrix while simultaneously selecting informative features using Laplacian scores.
Specifically, DUFS computes a binary vector $\myvec{s} \in \{0,1\}^{d}$ that indicates which features are kept ($s_j = 1$) and which features are not ($s_j = 0$). Let $\Delta(\myvec{s})$ denote a diagonal matrix with $\myvec{s}$ on the diagonal. At each iteration of DUFS, the Laplacian is computed based on 
%is  multiplied by the original data matrix $\myvec{X}$ to get a cleaner data matrix $\myvec{\tilde{X}}$, i.e.,  $\myvec{\tilde{X}} = \myvec{X} \odot \myvec{S}$, where $\odot$ represents the Hadamard product (element-wise multiplication) and $\myvec{S} \in \mathbb{R}^{n\times d}$ contains $n$ copies of $\myvec{s}$. A cleaner Laplacian matrix $\myvec{L}_{\tilde{x}}$ is computed based on the gated input $\myvec{\tilde{X}}$, and then $\myvec{S}$ is further optimized, so on and so forth. This process is automated by optimizing the following loss function via gradient-descent:
$\myvec{\tilde{X}} = \myvec{X}\Delta(\myvec{s})$, while simultaneously updaing $\myvec{s}$ by optimizing over the following loss function,  %of the  and the

\begin{equation}
     \mathcal{L} = -\frac{1}{n} \text{Tr}[\myvec{\tilde{X}}^T \myvec{L}_{\tilde{x}} \myvec{\tilde{X}}] + \lambda \|\myvec{s} \|_0,
    \label{eq:DUFS_loss1}
\end{equation}
where $\text{Tr[]}$ denotes the matrix trace.  The first term equals the sum of Laplacian Scores across all features normalized by the total number of samples $n$ in a training batch. The second term is a $\ell_0$ regularizer that imposes sparsity to the number of selected features, with $\lambda$ being a tunable parameter that controls the sparsity level. The output of DUFS is a list of a small number of selected features, and the Laplacian matrix $\myvec{L}_{\tilde{x}}$ learned from them. %will be learned, and a small number of features that lead to the highest Laplacian Scores $\myvec{L}_{\tilde{x}}$ are selected by $\myvec{s}$.

\label{sec:stg}
However, the discrete nature of the $\ell_0$ regularizer, %the standard discrete indicator vector $\myvec{s} \in \{0,1\}^D$ will make objective in Eq. 
makes the objective in \eqref{eq:DUFS_loss1} non differentiable, and thus finding the optimal vector $\myvec{s}$ intractable. Following \citep{yamada2020feature}, one can relax the $\ell_0$ norm to a probabilistic differentiable counterpart, by replacing the binary indicator vector $\myvec{s}$ with a relaxed Bernoulli vector $\myvec{z}$. Specifically, $\myvec{z}$ is a continuous Gaussian reparametrization of the discrete random variables, termed Stochastic Gates.
%which is parameterized by the Bernoulli parameters $\myvec{\pi} \in [0,1]^{d}$. %Although the objective is now differentiable by using REINFORCE \citep{reinforce} or REBAR \citep{tucker2017rebar}, these methods still suffer from high variance and require many Monte Carlo samples. 
%The indicator vector adopted by DUFS is termed Stochastic Gates \citep{yamada2020feature}. Instead of $\myvec{s}$ or $\myvec{\tilde{s}}$, Stochastic Gates learns the indicator vector $\myvec{z}$, which is a continuous Gaussian reparametrization of the discrete random variables.
It is defined for each feature $i$:
\begin{equation}
    z_i = \max(0,\min(1,0.5+ \mu_i + \epsilon_i)), \quad \epsilon_i \sim \mathcal{N}(0 ,\sigma^2)
\label{eq:z_stg}
\end{equation}
where $\mu_i$ is a learnable parameter, and $\sigma$ is fixed throughout training. % As shown in \citep{yamada2020feature}, this reparameterization can reduce the variance of the gradient estimates and has shown improved feature selection performance \citep{lindenbaum2021differentiable}.
The loss function in Eq. \eqref{eq:DUFS_loss1} can now be reformulated as follows, which is the final objective of the DUFS:
\begin{equation}
    \mathcal{L} = -\frac{1}{n} \text{Tr}[\myvec{\tilde{X}}^T \myvec{L}_{\tilde{x}} \myvec{\tilde{X}}] + \lambda \|\myvec{z} \|_0.
    \label{eq:DUFS_loss2}
\end{equation}


\section{Method}
\label{sec:method}
We now derive our approach for unsupervised feature selection in multi-modal settings. 
Our method is designed to capture two types of features: (i)  
 Features associated with latent structures that are \textit{shared} between two modalities.   
    (ii) Features associated with  \textit{differential 
 latent structures}, that appear in only one modality.
 In Sec. \ref{sec:joint_op} and \ref{sec:diff_op}, we derive two operators designed to capture shared and differential structures, respectively. 
To motivate our approach and illustrate the difference between shared and differential structures, we specifically address two examples: (i) shared and differential clusters and (ii) product of manifolds. We use the proposed operators in Sec. \ref{sec:mvdufs} to derive mmDUFS. 

%Our algorithm generalizes DUFS by introducing two new 
%with the new joint and differential operators, where the joint operator is designed to capture the common structures and the differential operator is designed to capture the modality-specific structures.  We  define these operators in the next section.  

%Section \ref{sec:dufs} and \ref{sec:stg} introduced a single-modal feature selection framework, DUFS, which learns the underlying graph by removing noisy features quantified using Laplacian Scores. 

\subsection{The shared structure operator}
\label{sec:joint_op}
%\everypar{\looseness=-2}
%In this section, we develop an operator that captures latent structures that appear in two modalities. 
To motivate our approach,  %method to capture features associated with latent structures that appear in two modalities, denoted $\myvec{X}$ and $\myvec{Y}$.    
let us consider the artificial example illustrated in Fig. \ref{fig:vis_joint_op}. %The left panel shows the observations for the two modalities. 
The lower figure in the left panel shows the observations in modality $\myY$, which contains samples from a mixture of three distinct Gaussians. The upper figure shows modality $\myX$, where one of the three clusters is partitioned again into three (less distinct) clusters.% Let $\myvec{L}_y, \myvec{L}_x$ denote the Laplacian of modalities $\myY, \myX$, respectively. 
%We denote by 
%, and let $V_y,V_x$ be two matrices that contain, as columns, their leading eigenvectors. 
\begin{figure*}[!htb] %
    \centering
    
    \includegraphics[width=0.8\textwidth]{Figs/illustration_v1}
    
    \caption{Visualization of the eigenvectors and the affinity matrix  of the proposed operators on an artificial cluster example. Left: Visualization of the clusters. Middle: Leading eigenvectors of $\myvec{L}_x$ and $\myvec{L}_y$.  Right: Affinity matrices of the proposed shared graph operator (top) and the differential graph operator (bottom) with/without the presence of noisy features.}
    %from the operators with/without noisy features.}%
    \label{fig:vis_joint_op}
\end{figure*}

It is instructive to study the \textit{ideal setting} where we make the following assumptions: (i) The largest distance between two nodes within a cluster, denoted $d_{\text{within}}$ is much smaller than the 
smallest distance between pairs of nodes of two clusters, denoted  $d_{\text{between}}$. (ii) The bandwidth $\sigma_x,\sigma_y$ is chosen such that $d_{\text{within}} \ll \sigma_x,\sigma_y \ll d_{\text{between}}$. In this setting, the three Gaussians constitute three main clusters, with no connections between pairs of nodes of different clusters and similar weights between pairs of nodes within clusters. 
Thus, the leading  eigenvectors of $\myvec{L}_y$ span the subspace of the three \textit{indicator vectors}. That is, vectors that contain the square root of the degree of a node in a cluster and a zero value outside the cluster. See \citet{von2007tutorial} and illustration in 
Fig. \ref{fig:vis_joint_op}.  The matrix $\myvec{L}_x$ has two extra significant eigenvectors that span the separation of the third cluster, which appears only in $\myX$. 
We denote by $\myvec{V}_s$ a matrix that contains the indicator vectors of the three partitions that appear in $\myX$ and $\myY$ and by $\myvec{V}_x$ a matrix that contains the partitions that appear only in $\myX$.
%We denote by $V_s$ a matrix that contains, as columns, orthogonal vectors that span the subspace that captures the separation of the data into the main three clusters, which appear both in $X$ and $Y$. In addition, we denote by $V_x$ a matrix that contains as columns the separation of the third cluster that appears only in modality $X$. Note that the columns of $V_x$ are orthogonal to the ones of $V_s$. 
Since there is no modality-specific structure in modality $\myY$, in our ideal setting the two Laplacian matrices $\myvec{L}_x,\myvec{L}_y$ can be approximated by
\begin{equation}
    \myvec{L}_x \approx \myvec{V}_s  \myvec{V}_{s}^{T} +  \myvec{V}_x  \myvec{V}_{x}^{T}, \qquad
    \myvec{L}_y \approx \myvec{V}_s  \myvec{V}_{s}^{T}. \label{eq:Laplacian_rewrite}
\end{equation}

%[$\myvec{V}_{s}$, $\myvec{V}_{x}$] be the span of the eigenvectors in $\myvec{X}$, [$\myvec{V}_{s}$, $\myvec{V}_{y}$] as the span of the eigenvectors in $\myvec{Y}$, where $\myvec{V}_{s} \in \mathbb{R}^{n\times l_s}$ are the eigenvectors that capture the shared structure between $\myvec{X}$ and $\myvec{Y}$, and $\myvec{V}_{x} \in \mathbb{R}^{n\times l_x}$ and $\myvec{V}_{y} \in \mathbb{R}^{n\times l_y}$ are the eigenvectors that capture the modality-specfic structures in each modality. We assume that the modality-specific structures are orthogonal, i.e., $\myvec{V}_{x}^T\myvec{V}_{y} = \myvec{0} \in \mathbb{R}^{l_x\times l_y} $. 
To capture \textit{shared} latent structures we compute the operator $\myvec{P}_{\text{shared}}$,
\begin{equation}
    \myvec{P}_{\text{shared}} = \myvec{L}_x \myvec{L}_y + \myvec{L}_y \myvec{L}_x.
    \label{eq:composite_op}
\end{equation}
%In Fig. \ref{fig:vis_joint_op} we simulate a synthetic cluster example where there is a  tri-cluster structure shared between modalities. Additionally, the red cluster in modality 2 can be further split into three small clusters in modality 1. We can visualize the eigenvectors 
For the cluster setting, the orthogonality between the matrices $\myvec{V}_s,\myvec{V}_x$ implies $
     \myvec{P}_{\text{shared}} \approx 2\myvec{V}_s \myvec{V}_{s}^{T}.
$
%In this case, the product discards any eigenvectors indicative of structures that are modality specific, and we remain only with structures shared between the modalities. 
%To illustrate that $\myvec{P}_{\text{joint}}$ can capture the shared structure, we simulate a synthetic cluster example as shown in Fig. \ref{fig:vis_joint_op}, where there is a tri-cluster structure shared between modalities (blue, green, and the red). Additionally, the red cluster in modality $\myvec{Y}$ can be further split into three smaller clusters (orange, yellow, and purple) in modality $\myvec{X}$. 
%We compute $\myvec{L}_{x}$ and $\myvec{L}_{y}$ and eigendecompose each Laplacian to get the top eigenvectors, which are plotted 
%in the middle of Fig. \ref{fig:vis_joint_op}, and the values inside each eigenvector are colored by their cluster labels. We can see that 
%modality $\myvec{X}$ has $4$ significant eigenvectors where the first $2$ eigenvectors ($\myvec{V}_s$) capture the shared tri-cluster structure and the next $2$ eigenvectors ($\myvec{V}_x$) capture the modality-specific small clusters, whereas modality $\myvec{Y}$ has $2$ significant eigenvectors ($\myvec{V}_s$) that
%capture the shared tri-cluster structure. As a result, we can see that the shared tri-cluster structure is captured by the constructed joint operator as shown in the affinity matrix on the right of Fig. \ref{fig:vis_joint_op}.
%and the first multiplication in Eq. \eqref{eq:composite_op} project the structure of $\myvec{Y}$ onto the subspace of the leading eigenvectors of $\myvec{X}$ with large component. 
%The second multiplication will do the projection in the opposite direction and make this operator $\myvec{P}_{\text{joint}}$ symmetric, positive semi-definite (eigenvalues of this operator will be real and non-negative). 
Thus, the symmetric product of the two Laplacians captures clusters that appear in both modalities while removing modality-specific clusters; see right panel of Fig. \ref{fig:vis_joint_op}.
We note that related multimodal operators were previously proposed \citep{lindenbaum2020multi,shnitzer2019recovering} for computing low-dimensional representations. Here, we combine our operator with DUFS to develop a multi-modal feature selection pipeline. 
We illustrate the usefulness of the shared operator for the product of manifold setting.  %In Section \eqref{sec:mvdufs}.%, we incorporate Stochastic Gates into this operator to develop a framework for unsupervised multi-modal feature selection. %develop ase the operator 

%\subsection{Product of manifolds.} 

\paragraph{Product of manifolds.}
Let $\M_a,\M_b$ and $\M_s$ be three
low-dimensional manifolds embedded in high dimensional spaces. Here, we assume that the surface of the three manifolds is a smooth transformations of three sets of latent variables, denoted respectively by $\myvec{\theta}_a,\myvec{\theta}_b$ and $\myvec{\theta}_s$.
Consider the case where modalities $\myX$ and $\myY$ each contains observations from the products
$\M_y,\M_x$, 
%We assume that both our modalities $X,Y$ are observations generated uniformally at random over the product of manifolds $\M_X$ and $\M_Y$, 
\[
\M_y = \M_s \times \M_a, \qquad \M_x = \M_s \times \M_b.
\]
Note that the dependence on $\M_s$ is shared between $\M_x, \M_y$, while the dependence on $\M_a,\M_b$ is modality-specific.  %Here, we provide the details relevant for  
In a product $\M_x = \M_s \times \M_b$, every point $\myvec{x} \in \M_x$ is associated with two points $\myvec{x}_s \in \M_s$ and $\myvec{x}_b \in \M_b$. We define the projection operators $\pi^x_b(\myvec{x}),\pi^x_s(\myvec{x})$ that map a point $\myvec{x}$ in $\M_x$ to points in $\M_b,\M_s$, respectively. With the projection operators, one can extend a function $f^b: \M_b \to \R$ to a function over the product $f^x: \M_x \to \R$ by $f^x(\myvec{x}) = f^b( \pi^x_b(\myvec{x}))$.

An important property of a product $\M_x$ is that the eigenfunctions $f_{l,m}^x$ of the Laplace Beltrami operator are  equal to the pointwise product of the eigenfunctions of  $\M_b,\M_s$, extended to $\M_x$. %(see Theorem 2 and further details in \citep{zhang2021product}). 
\begin{equation}\label{eq:eigenfunctions_product}
f^x_{l,m}(\myvec{x}) = f^s_l (\pi_s^x(\myvec{x})) \cdot f^b_m (\pi_b^x(\myvec{x})). %\qquad f^y_{m,n} = f^b_m \circ f^s_m.
\end{equation}
We refer to \citep{zhang2021product} for a detailed description of products of manifold properties.
A simple example of a product of manifolds is a 2D rectangle area $(\theta_s,\theta_b) \in [0,l_s] \times [0,l_b]$. 
the projection $\pi_s^x$ yields the first coordinate, while $\pi_b^x$ yields the second. 
The eigenfunctions of the product with Neumann boundary conditions are equal to,
\begin{equation}\label{eq:rectangle}
f_{l,m}(\theta_s,\theta_b) = \cos(\pi l \theta_s/l_s) \cos(\pi m \theta_b/l_b). 
\end{equation}

\paragraph{Observations generated uniformly at random over the product of manifolds.}
Here, we assume that the observations in the two modalities are generated by random and independent uniformly distributed samples over $\M_x,\M_y$. Let $\myvec{\phi}_{l,m}^x(\myvec{x}_i),\myvec{\phi}_{l,k}^y(\myvec{y}_i)$ denote the eigenvectors of $\myvec{L}_x,\myvec{L}_y$ evaluated at $\myvec{x}_i,\myvec{y}_i$ respectively. 
In the asymptotic regime where the number of points $n \to \infty$, the eigenvectors converge to the eigenfunctions as characterized in Eq. \eqref{eq:eigenfunctions_product}.
\begin{align}
\myvec{\phi}_{l,m}^x(\myvec{x}_i) &= \myvec{\phi}^s_l( \pi_s^x(\myvec{x}_i)) \myvec{\phi}^b_m(\pi_b^x(\myvec{x}_i)) \notag \\
\myvec{\phi}_{l,k}^y(\myvec{y}_i) &= \myvec{\phi}^s_l(\pi_s^y(\myvec{y}_i)) \myvec{\phi}^a_k(\pi_a^y(\myvec{y}_i)). 
\end{align}

Details about the definition and rate of convergence can be found, for example, in \citep{cheng2022eigen,garcia2020error}, and reference therein.
%and can thus be expressed as the following pointwise product
It is instructive to consider the ideal case, where due to their dependence on the independent projections $\pi^x_b$ and $\pi^x_a$, the eigenvectors $\myvec{\phi}_{l,m}^x,\myvec{\phi}_{l,k}^y$ satisfy the following 
orthogonality property,
\begin{equation}
(\myvec{\phi}^x_{l,m})^T \myvec{\phi}^y_{l',k}  
= 
\begin{cases}
1 & l = l', m = k = 0 \\
0 & o.w.
\end{cases}
\end{equation}
It follows that the operator $\myvec{P}_{\text{shared}}$ is equal to,
\begin{equation}\label{eq:shared_product}
\myvec{P}_{\text{shared}} = \myvec{L}_x \myvec{L}_y +\myvec{L}_y \myvec{L}_x = \sum_l (\myvec{\phi}^s_l \otimes \myvec{\phi}^a_0) (\myvec{\phi}^s_l \otimes \myvec{\phi}^b_0)^T,
\end{equation}
where $\otimes$ denotes the Hadamard product. 
The vectors $\myvec{\phi}^a_0,\myvec{\phi}^b_0$ constitute the degree of the different observations and have little effect on the outcome. Thus, the leading eigenvectors of $\myvec{P}_{\text{shared}}$ are associated with the shared component and not the differential components in the product of manifolds. Below, we illustrate this phenomenon with two examples. 


%\paragraph{The shared graph operator of products of manifolds.}
%Consider the shared graph operator in Eq. \eqref{eq:composite_op}, where $\myvec{L}_x,\myvec{L}_y$ are Laplacian matrices of points generated by the product of manifolds as in Eq. \eqref{eq:eigenvectors_product}.  

%\begin{align}
    %\myvec{P}_{\text{shared}} &= \myvec{L}_x \myvec{L}_y +\myvec{L}_y \myvec{L}_x \notag \\ 
    %&= \sum_{nm}\sum_{n'm'} \lambda_{mn} \lambda^{'}_{m'n'} 
    %\eta_{nm}^{n'm'}
    %(v^S_m  \circ v^A_n) (v^S_{m'}  \circ v^A_{n'})^T,
%\end{align}
%where
%\[
%\eta_{nm}^{n'm'} = \langle v^S_m  \circ v^A_n, v^S_{m'}  \circ v^A_{n'} \rangle. 
%\]
%Since the latent parameters $\theta_A$ and $\theta_B$ that underlay the manifolds $\M_A$ and $\M_B$ are independent, then
%\[
%\eta_{nm}^{n'm'} 
%= 
%\begin{cases}
%1 & m = m', n = n' = 0, \\
%0 & o.w.
%\end{cases}
%\]
%It follows that the shared operator converges to the following sum,
%\[
%\myvec{P}_{\text{shared}} = \sum_m \lambda_{m,0} \lambda_{m,0} v_m^S v_m^S.
%\]
%Thus, the shared operator contains only the eigenvectors of the shared component, and not the differential components of the manifold products. 

\paragraph{Example 1: points in a 3D cube.}
Consider points generated uniformly at random over a 3D cube of dimensions $[0,l_s] \times [0,l_a] \times [0,l_b]$. Let $\myY \in \R^{n \times 2}$ constitute the first two coordinates of $n$ independent observations, and let $\myX$ constitute the first and third coordinates. 
This is a simple case of a product of manifolds, where the shared variable $\theta_s$ is the first coordinate, while the modality-specific variables $\theta_a,\theta_b$ are the second and third coordinates.
Following Eq. \eqref{eq:rectangle}, the eigenvectors of the graph Laplacian matrices $\myvec{L}_x, \myvec{L}_y$, evaluated at  $(\theta_s,\theta_b)$ and $(\theta_s,\theta_a)$ converge to,
\begin{align}
&\phi_{lm}^x(\theta_s,\theta_b) = \cos(\pi l \theta_s/l_s)\cos(\pi m \theta_b/l_b) \notag \\
&\phi_{lk}^y(\theta_s,\theta_a) = \cos(\pi l \theta_s/l_s)\cos(\pi k \theta_a/l_a).
\end{align}
The first row of Fig. 1 (Appendix A) shows a scatter plot of the points in $\myvec{X}$ (located according to the first two coordinates), colored by the values of the leading eigenvectors of $\myvec{L}_x$. %Similarly, the second row shows a scatter points of the observations in $Y$ colored by the eigenvectors of $L_y$. 
The second row shows the points in $\myvec{X}$, but colored by the eigenvectors of $\myvec{P}_{\text{shared}}$. As expected, all the eigenvectors of $\myvec{P}_{\text{shared}}$ are functions of the shared coordinate $\theta_s$. 

\paragraph{Example 2: videos taken from different angles.}
Our second example is based on an experiment done in \citep{lederman2014common}, where the two modalities constitute two videos of three dolls rotating at different angular speeds. The first camera (modality $\myX$) captures the middle and left doll, while the second camera (modality $\myY$) captures the middle and right dolls (see Fig. \ref{fig:video_samples}). Here, the shared variable $\myvec{\theta}_s$ is the angle of the middle doll captured by both modalities. The modality-specific variables $\myvec{\theta}_a,\myvec{\theta}_b$ are the angles of the left and right dolls captured by each modality separately.

%Let $\theta^l_i,\theta^m_i,\theta^r_i \in [0,2\pi]$ denote the angles of the three dolls at time $t_i$. 
%The first video denoted $\myX$, captures the left and middle, while the right video, $\myY$, captures the middle and right dolls. Thus, video $\myX$ can be viewed as a transformation over $\theta^l,\theta^m$, while the right video is a transformation over $\theta^m,\theta^r$.
%

To illustrate Eq. \eqref{eq:shared_product} in this example, we first compute an approximation of the eigenvectors $\myvec{\phi}_l^s$. To that end, we cropped each image in one of the videos such that only the middle doll (which appears in both modalities) is shown. One may think of this operation as a projection to the shared manifold. Next, we computed from the cropped images the leading eigenvectors $\myvec{\phi}^s_l$ of the Laplacian matrix. 
Fig. 2 (Appendix A) shows the
leading three eigenvectors of $\myvec{P}_{\text{shared}}$ as a function of 
$\myvec{\phi}^s_1,\myvec{\phi}^s_2,\myvec{\phi}^s_3$ as computed by the cropped images. The figure shows a linear dependency between the vectors, which implies that the shared operator retained only the shared component of the two modalities.

\subsection{The Differential Graph Operators}\label{sec:diff_op}


We design two operators $\myvec{Q}_{x}$ and $\myvec{Q}_{y}$ to infer latent structures that are \textit{modality specific} to $\myX,\myY$ respectively. 
\begin{equation}
    \myvec{Q}_{x} = \tilde{\myvec{L}}_y^{-1} \myvec{L}_x  \tilde{\myvec{L}}_y ^{-1},
\qquad
%\begin{equation}
    \myvec{Q}_{y} = \tilde{\myvec{L}}_x^{-1} \myvec{L}_y  \tilde{\myvec{L}}_x^{-1} 
    \label{eq:diff_op_y},
\end{equation}
where $\tilde{\myvec{L}}_x = \myvec{L}_x + c\myvec{I}$, $\tilde{\myvec{L}}_y = \myvec{L}_y + c\myvec{I}$, and $c$ is a regularization constant. We address the cluster example used for the shared operator to motivate the use of these operators.

\paragraph{Differential clusters.}
%Here, we address the cluster example visualized in Fig. \ref{fig:vis_joint_op}. 
In the synthetic cluster example in Fig. \ref{fig:vis_joint_op}, modality $\myX$ has three smaller clusters not observed in modality $\myY$.
We show that one can detect the \textit{differential clusters} of modality $\myX$ via the leading eigenvectors of $\myvec{Q}_x$.
By Eq. \eqref{eq:Laplacian_rewrite}, we can approximate $\tilde{\myvec{L}}_y$ via,
\begin{align}
        \tilde{\myvec{L}}_y &= (1+c)\myvec{V}_s \myvec{V}_{s}^{T}  + c\myvec{V}_{\text{comp}}\myvec{V}_{\text{comp}}^{T},  \label{eq:Laplacian_tilde_rewrite}
\end{align}

where $\myvec{V}_{\text{comp}} \in \mathbb{R}^{n\times (n-3)}$ contains, as columns, vectors that span the complementary subspace to $\myvec{V}_{s}$. We write $\myvec{Q}_x$ as:
\begin{equation}
    \myvec{Q}_x = \tilde{\myvec{L}}_y^{-1} \myvec{L}_x  \tilde{\myvec{L}}_y ^{-1}  =  
    c^{-2} \myvec{V}_x  \myvec{V}_x^T +  (1+c)^{-2}\myvec{V}_s \myvec{V}_s^{T}.   %\notag\\
    %&= \myvec{V}_s \myvec{\Delta}_{\Tilde{sx}} \myvec{V}_s^{T} + \myvec{V}_x \myvec{\Delta}_{\Tilde{x}} \myvec{V}_x^T %\\ 
   %\myvec{Q}_y &= \tilde{\myvec{L}}_x^{-1} \myvec{L}_y  \tilde{\myvec{L}}_x ^{-1} \notag\\
   %&= \myvec{V}_s  (\myvec{\Delta}_{sx} +c\myvec{I})^{-1} \myvec{\Delta}_{sy} (\myvec{\Delta}_{sx}+c\myvec{I})^{-1} \myvec{V}_s^{T} +  \myvec{V}_y \frac{\myvec{\Delta}_y}{c^2} \myvec{V}_y^T \notag\\
   %&= \myvec{V}_s \myvec{\Delta}_{\Tilde{sy}} \myvec{V}_s^{T} + \myvec{V}_y \myvec{\Delta}_{\Tilde{y}} \myvec{V}_y^T                
   \label{eq:Qxy_rewrite}
\end{equation}

%where $\myvec{\Delta}_{\Tilde{sx}} = (\myvec{\Delta}_{sy} +c\myvec{I})^{-1} \myvec{\Delta}_{sx} (\myvec{\Delta}_{sy}+c\myvec{I})^{-1}$,  $\myvec{\Delta}_{\Tilde{sy}} = (\myvec{\Delta}_{sx} +c\myvec{I})^{-1} \myvec{\Delta}_{sy} (\myvec{\Delta}_{sx}+c\myvec{I})^{-1} $, $\myvec{\Delta}_{\Tilde{x}} = \frac{\myvec{\Delta}_x}{c^2} $, and $\myvec{\Delta}_{\Tilde{y}} = \frac{\myvec{\Delta}_y}{c^2}$. 

The differential operator in Eq. \eqref{eq:Qxy_rewrite} has two terms. The first spans the subspace corresponding to the differential structure $\myvec{V}_x$, while the second spans the subspace of the shared structure $\myvec{V}_s$. %A reasonable choice of the regularization constant $c$ would satisfy that $c$ is much smaller than the smallest eigenvalue if $\Delta_{sy},\Delta_x$, which implies that 
%\[
%\Delta_x/c^2 \succ (\myvec{\Delta}_{sy} +c\myvec{I})^{-2} \myvec{\Delta}_{sx},
%\]
%where the notation $A \succ B$ implies that the matrix $A-B$ is positive definite. 
Since $c^{-2} > (1+c)^{-2}$, it follows that the leading eigenvectors of $\myvec{Q}_x$ span the subspace of $\myvec{V}_x$.
%, where the first term spans the subspace of the shared structures ($\myvec{V}_s$) and the second term spans the subspace of the modality-specific structures ($\myvec{V}_x$ and $\myvec{V}_y$). Intuitively, to let the operator emphasize the modality-specific structures, the eigenvalues of the second term need to be much larger than the eigenvalues of the first term.
%To see this, we take $\myvec{Q}_x$ as an example and define $\myvec{P}_x =  \myvec{\Delta}_{\Tilde{sx}}^{-1} \myvec{\Delta}_{\Tilde{x}} $ as the balance between the two sets of eigenvalues. For the simplest case where each subspace is rank 1, it can be reduced to:
%\begin{align}
%    \rho_x &= \frac{\sigma_{\Tilde{x}}}{\sigma_{\Tilde{sx}}} = \frac{\sigma_x}{\sigma_{sx}} \frac{(\sigma_{sy} + c)^2}{ c^2} = a(\frac{\sigma_{sy}}{c} + 1)^2 \label{eq:rho}
%\end{align}
%where $a =\frac{\sigma_x}{\sigma_{sx}}$. Assuming $a$ is dataset-dependent constant, we can see that when $c << \sigma_{sy}$, $\rho_x$ will be large, therefore emphasizing the modality-specific structures.

 
%The intuition behind these two operators is the following: Let us denote $\myvec{K}_x = \sum_{i=1}^{n}(\lambda_{x})_i ({\myvec{u}_x})_i ({\myvec{u}_x})_i^T$ and $\myvec{K}_y = \sum_{i=1}^{n}(\lambda_{y})_i ({\myvec{u}_y})_i ({\myvec{u}_y})_i^T$ as the eigendecompositions of each affinity matrix.  The inverses of each affinity matrix can then be expressed as  $\myvec{K}_x^{-1} = \sum_{i=1}^{n}(\frac{1}{\lambda_{x}})_i ({\myvec{u}_x})_i ({\myvec{u}_x})_i^T$ and $\myvec{K}_y^{-1} = \sum_{i=1}^{n}(\frac{1}{\lambda_{y}})_i ({\myvec{u}_y})_i ({\myvec{u}_y})_i^T$. By taking the inverse, the directions of eigenvectors that correspond to large eigenvalues (i.e., directions that are on the subspace of the major structures) now correspond to small eigenvalues, and the directions that correspond to small eigenvalues (i.e., directions that are on the subspace complementary to the subspace of the major structures) now correspond to large eigenvalues. What $\myvec{K}_y^{-1}\myvec{K}_x\myvec{K}_y^{-1}$ effectively does is that it projects the major structures in $\myvec{X}$ onto a subspace that is complementary to the subspace that has major structures of $\myvec{Y}$, therefore captures $\myvec{X}$-specific structures. The same applies to $\myvec{K}_x^{-1}\myvec{K}_y\myvec{K}_x^{-1}$ for the opposite operation.

%To avoid the original small eigenvalues in $\myvec{K}_x$ and $\myvec{K}_y$ exploding after taking the inverse, we add a constant term $c\myvec{I}$ to bound the eigenvalues such that $\text{min}(\Lambda) \geq c$ where $\Lambda$ are the eigenvalues of $\myvec{K}_x + c\myvec{I}$ or $\myvec{K}_y + c\myvec{I}$.





In theory, we can directly apply these operators to learn the structures. However, in many real-world applications, e.g., single-cell multi-omic technologies, both $\myvec{X}$ and $\myvec{Y}$ can be very noisy. In particular, abundant noisy features (e.g., genes) might dominate the data. The top eigenvectors of $\myvec{L}_x$ and $\myvec{L}_y$ might not capture the underlying structure, which would be detrimental to the learning of $\myvec{P}_{\text{shared}}$, $\myvec{Q}_x$, and $\myvec{Q}_y$. As shown in the affinity matrices on the right of Fig. \ref{fig:vis_joint_op}, the structures are less clear when many noisy features are present. Therefore, it is necessary to have a feature selection framework that can effectively remove these noisy features in our multi-modal setting. With the aforementioned DUFS feature selection framework as the foundation, we will show in the next section how we can incorporate it into our proposed operators in the multi-modal setting.

\subsection{mmDUFS}
\label{sec:mvdufs}

In this section, we describe our  framework, termed multi-modal Differential Unsupervised Feature Selection (mmDUFS). We incorporates differentiable gates \citep{lindenbaum2021differentiable}
with loss functions based on the shared and differential operators, detailed in Sec. \ref{sec:joint_op} and \ref{sec:diff_op}. %This allows us to define an unsupervised feature selection approach for multi-modal data. 
Our goal is to compute an accurate shared graph operator ($\myvec{P_{\text{shared}}}$ in Eq. \eqref{eq:composite_op}) %that gives us the shared structure, 
and differential graph operators ($\myvec{Q}_x$ and $\myvec{Q}_y$ in Eq. \eqref{eq:diff_op_y}) %that gives us the modality specific structures, 
while simultaneously selecting the informative features.
%As our 
%To detect features that are smooth with respect to structures shared by the two modalities, we combine the definition of Laplacian score in Eq. \ref{eq:LS} with the shared operator, to design the shared loss function $
%\ell_{\text{shared}}(\myvec{f}) = \myvec{f} P_{\text{shared}} \myvec{f}. $
Let $\myvec{f}_x,\myvec{f}_y$  denote a feature vector in $\myX,\myY$, respectively. 
To quantify how noisy or informative the features are with respect to the shared structure, we replace the Laplacian $\myvec{L}$ in Eq. \eqref{eq:LS} with $\myvec{P_{\text{shared}}}$, which yields the shared score
%\[
%For the shared structures this yields
%More specifically, we propose to use 
%\text{LS}_{P_{\text{shared},x}} = 
%\ell(P_{\text{shared}},\myvec{f}_x)=
$\myvec{f}_x^T \myvec{P_{\text{shared}}} \myvec{f}_x $ and
%\text{LS}_{P_{\text{shared},y}} =
%\ell(P_{\text{shared}},\myvec{f}_y) =
$\myvec{f}_y^T \myvec{P_{\text{shared}}} \myvec{f}_y$.
%\]
%to quantify the smoothness of feature $\myvec{f}_x$ of modality $\myX$ and $\myvec{f}_y$ of modality $\myY$ with respect to the shared graph operator $\myvec{P_{\text{shared}}}$. 
%Similarly, we %define the $X$ and $Y$ modality-specific loss via,
%$
%\ell^x(\myvec{f})  = \myvec{f}^T\myvec{Q}_x\myvec{f}$ and $  
%\ell^y(\myvec{f}) = \myvec{f}^T\myvec{Q}_y\myvec{f}.
%$
Similarly, 
$\myvec{f}_x^T\myvec{Q}_x\myvec{f}_x$ and $\myvec{f}_y^T\myvec{Q}_y\myvec{f}_y$ quantify the smoothness of these features with respect to the differential graph operators $\myvec{Q}_x$ and $\myvec{Q}_y$.
The rationale behind these generalized Laplacian Scores is similar to the original score. For instance, let $\myvec{P_{\text{shared}}} = \sum_{i=1}^{n}\lambda_{i} \myvec{u}_i \myvec{u}_i^T$ be the eigendecomposition of $\myvec{P_{\text{shared}}}$. A feature vector $\myvec{f}_x$ that varies slowly with respect to the underlying shared structure has a larger component within the subspace spanned by the leading eigenvectors 
 of $\myvec{P_{\text{shared}}}$, and thus a higher score.

To learn features with high generalized Laplacian Scores and accurate graph operators, mmDUFS learns two sets of Stochastic Gates $\myvec{z}_x$ and $\myvec{z}_y$ that filter irrelevant  features in each modality. Similar to DUFS \citep{lindenbaum2021differentiable}, these stochastic gates multiply the data matrices $\myvec{X}$ and $\myvec{Y}$ to remove nuisance features, i.e., %$\myvec{\tilde{X}} = \myvec{X} \odot \myvec{Z}_x$
$\myvec{\tilde{X}} = \myvec{X}\Delta(\myvec{z}_x)$
and 
%$\myvec{\tilde{Y}} = \myvec{Y} \odot \myvec{Z}_y$
$\myvec{\tilde{Y}} = \myvec{Y} \Delta(\myvec{z}_y)$.
%, where $\myvec{Z}_x \in \mathbb{R}^{n\times d}$ represent $n$ copies of $\myvec{z}_x$ and $\myvec{Z}_y \in \mathbb{R}^{n\times m}$, represents $n$ copies of  $\myvec{z}_y$.
At each iteration, the updated graph operators ($\myvec{\tilde{P}_{\text{shared}}}$, $\myvec{\tilde{Q}}_{x}$, $\myvec{\tilde{Q}}_{y}$)  are recomputed based on the gated inputs.% and $\myvec{z}_x$ and $\myvec{z}_y$ can then be further optimized. 




mmDUFS has two modes: (i) detecting shared structures  using the shared graph operator $\myvec{\tilde{P}_{\text{shared}}}$, and (ii) detecting  modality-specific structures using the differential graph operators $\myvec{\tilde{Q}}_{x}$, and $\myvec{\tilde{Q}}_{y}$. %Let $\myvec{f}^x_i ,\myvec{f}^y_i$ denote the $i$-th feature vector of modalities $X,Y$ respectively. 
To learn the shared structure and the corresponding features, we propose to optimize $\myvec{z}_x$ and $\myvec{z}_y$ by minimizing the following loss function:
\begin{align*}
    \mathcal{L}_{\text{shared}} &= -\frac{1}{n} \text{Tr}[\myvec{\tilde{X}}^T \myvec{\tilde{P}_{\text{shared}}} \myvec{\tilde{X}}] - \frac{1}{n}
    \text{Tr}[\myvec{\tilde{Y}}^T \myvec{\tilde{P}_{\text{shared}}} \myvec{\tilde{Y}}]
     \\
    &+ \lambda_x \|\myvec{z}_x \|_0 +\lambda_y \|\myvec{z}_y \|_0,
   % \mathcal{L}_{\text{shared}}(X,Y) 
   % &= -\frac{1}{d} \sum_{i = 1}^d \ell_{\text{shared}}(\myvec{f}^x_i) 
   % -\frac{1}{m} \sum_{i = 1}^m \ell_{\text{shared}}(\myvec{f}^y_i) \\    
   %  &+ \lambda_x \|\myvec{z}_x \|_0 +\lambda_y \|\myvec{z}_y \|_0
    %\label{eq:mvDUFS_loss1}
\end{align*}
where the first two terms are the Shared Laplacian Scores for each modality, and the regularizers 
%$\text{Tr}[\myvec{\tilde{X}}^T \myvec{P_{\tilde{\text{shared}}}} \myvec{\tilde{X}}]$ 
    %and $\text{Tr}[\myvec{\tilde{Y}}^T \myvec{P_{\tilde{\text{shared}}}} \myvec{\tilde{Y}}]$ are two sums of generalized Laplacian Scores for each modality with the shared graph operator $\myvec{P_{\tilde{\text{shared}}}}$, and
    $\lambda_x \|\myvec{z}_x \|_0$ and $\lambda_y \|\myvec{z}_y \|_0$ control the number of selected features for each modality. The parameters $\lambda_x,\lambda_y$ are tunable, and determine the sparsity level.  In Appendix B.1, we suggest a procedure to set these regularization parameters. 
To learn differential structures that appear in only one modality, we suggest the loss functions $\mathcal{L}_{x},\mathcal{L}_{y}$,
\begin{align}
    %\mathcal{L}_{x} &= -\sum_{i=1}^d \ell^x(\myvec{f}_i^x)
    %\mathcal{L}_{y} &= -\sum_{i=1}^m \ell^y(\myvec{f}_i^y)    
    \mathcal{L}_{x} &= -\frac{1}{n} \text{Tr}[\myvec{\tilde{X}}^T \myvec{\tilde{Q}}_{x} \myvec{\tilde{X}}]  
    +\lambda_x \|\myvec{z}_x \|_0 ,
\notag \\
    \mathcal{L}_{y} &= -\frac{1}{n}  \text{Tr}[\myvec{\tilde{Y}}^T \myvec{\tilde{Q}}_{y} \myvec{\tilde{Y}}]  
    + \lambda_y \|\myvec{z}_y \|_0.
    \label{eq:mvDUFS_loss2}
\end{align}
The first term in each loss is the Differential Laplacian Score. Optimizing over Eq. \eqref{eq:mvDUFS_loss2} yields a set of features that are smooth  with respect to the differential graph operators $\myvec{Q}_x$ 
 and $\myvec{Q}_y$, with a  sparsity level controlled by $\lambda_x$ and $\lambda_y$. In Section \ref{sec:simulation} we show the usefulness of these score functions for detecting relevant features. 
%where the left term in the two loss functions computes a sum of the generalized Laplacian Scores with respect to the differential operator for one modality. The regularization terms $\lambda_x \|\myvec{Z}_x \|_0$ and $\lambda_y \|\myvec{Z}_y \|_0$ control the number of selected features for each modality, with $\lambda_x$ and $\lambda_y$ being the tunable parameters that control the level of sparsity.

\section{Related Work}
Learning the latent structures in multi-modal data has been studied extensively in the context of data fusion, where most existing methods aim to extract shared information from the modalities \citep{DCCA,lederman2014common,zhou2007spectral,lindenbaum2015learning}. Only a few methods study the differences between modalities \citep{shnitzer2019recovering}. However, these multi-modal learning methods become unsuitable when many nuisance or noisy features are present in the data. In \citep{cohen2022manifest,sristi2022disc}, the authors use the manifold assumption to tackle feature selection and clustering in the supervised setting. In the unsupervised setting, several authors propose different Unsupervised Feature Selection (UFS) schemes to alleviate the influence of nuisance features. These methods aim to identify a subset of smooth features with respect to the underlying structure \citep{zhao2012spectral,lindenbaum2021differentiable,shaham2021deep}. However, they focus on a single modality and are not applicable to multi-modal data.

%Our work proposes two novel graph operators, the shared and the differential operators, to learn the shared and modality-specific structures in the multi-modal setup. We extend the DUFS framework \citep{lindenbaum2021differentiable} to perform multi-modal feature selection of the smooth features with respect to the proposed graph operators.

\section{Results}
\label{sec:simulation}

We benchmark mmDUFS  \footnote{Codes are available at https://github.com/jcyang34/mmDUFS} using synthetic and real multi-modal datasets. %with underlying structures that are shared or specific to each modality. 
For discovering the shared structures and associated features, we compare mmDUFS with the shared operator to the following variants of kernel fusion-based methods previously proposed for dimensionality reduction: (1) Matrix Concatenation (MC), where the Laplacian is computed based on a concatenated matrix of the two modalities. (2) 
Multi-modal Kernel Sum (mmKS) \citep{zhou2007spectral}, where the Laplacian is equal to $\myvec{L}_x + \myvec{L}_y$. (3) Multi-modal Kernel Product (mmKP) \citep{lindenbaum2015learning,lindenbaum2016multi,lindenbaum2020multi}. 
where the Laplacian is equal to $\myvec{L}_x\myvec{L}_y$.

To compare to the performance of mmDUFS on detecting differential features, we extended MC,mmKS and mmKP by the following steps: (i) compute sets $S_x,S_y$ of features that are smooth, separately, with respect to $L_x$ and $L_y$ via standard Laplacian scores. The selected features contain both shared and modality specific. (ii) Apply either MC, mmKS or mmKP to compute a set $S_{xy}$ of shared features. (iii) remove the shared features $S_{xy} $ from the sets $S_x,S_y$ to obtain the features that are modality specific to $\myX$ and $\myY$, respectively. 

%Assuming we want to obtain features specific to modality $\myX$,  we first compute a set $S_x$ of features that are smooth on $\myX$ using Laplacian Scores, without taking $\myY$ into consideration. Thus, the selected set of features contains both shared and modality-specific features. Next, we apply MC, mmKS or mmKP to compute a set $S_{xy}$ of shared features. The final step is to recover the modality specific features by removing from $S_x$ any feature that is also in $S_{xy}$. The remaining features are the modality-specific features. 


%Specifically, MC first builds a concatenated data matrix $\myvec{Z} = [X,Y]\in \mathbb{R}^{n\times (d+m)}$, then constructs the corresponding Laplacian matrix $\myvec{L}_z$. mmKS first constructs Laplacian matrices $\myvec{L}_x$ and $\myvec{L}_y$ for each modality, then computes the sum of the two matrices: $\myvec{L}_z =\myvec{L}_x + \myvec{L}_y$. mmKP first constructs the Laplacian matrices $\myvec{L}_x$ and $\myvec{L}_y$ for each modality, then computes the product of the two matrices: $\myvec{L}_z =\myvec{L}_x\myvec{L}_y$. 

For each baseline, the $k$ features with the highest Laplacian Scores are  selected. For the synthetic datasets, we set $k$ to be the correct number of informative features. We evaluate the performance of different methods by the F1-score $\text{F1} = \text{TP}/(\text{TP}+ \frac{1}{2(\text{FP}+\text{FN})})$, where TP is the number of informative features selected by each method, FP is the number of uninformative selected features, and FN is the number of missed informative features. For the rescaled MNIST and rotating doll examples, the informative features are set to the $25\%$ pixels with the highest standard deviation.

%\begin{itemize}
%    \item Artificial - comparing to the known features (F1 score)
%    \begin{itemize}
%        \item Concatenation
%        \item Laplacian sum
%        \item Kernel product
%    \end{itemize}
%    \item Artificical scRNA- same as Gaussian data
%    \item MNIST - Compare top (Say) 100 selected features to the pixels with highest average (F1 score)
%    \begin{itemize}
%        \item Same baselines        
%    \end{itemize}
%    \item Videos (somehow estimate) 
%\end{itemize}

\subsection{Synthetic Examples}

\paragraph{Rescaled MNIST.} %To illustrate our model's ability to select features corresponding to the shared and differential structures between modalities, 
We designed a rescaled MNIST example with shared and modality-specific digits.
We first randomly sample one image ($28 \times 28$ pixels) of digits $0$, $3$, $8$. Then, we rescale each digit randomly and independently $500$ times resulting with $500$ images of $0$, $3$, and $8$. We concatenate pairs of $0$ and $3$ to create modality $\myX$, and pairs of the same $3$ and random $8$ to create $\myY$, see example in Fig. \ref{fig:mnist_samples}. Thus, this dataset consists of $500$ samples and $28 \times 56$ pixels in each modality, with digit $3$ shared between the modalities and digit $0$ and $8$ modality specific.



\begin{figure*}[htb!]%
    \centering
   
    \parbox{\figrasterwd}{
    \centering
    \parbox{.2\figrasterwd}{%
     \centering
      \subcaptionbox{\label{fig:mnist_samples}}{\includegraphics[height=0.14\textwidth,width=0.15\textwidth]{Figs/MNIST_resized_example_digits}}
      \centering
      \subcaptionbox{\label{fig:mnist_gates}}{\includegraphics[height=0.14\textwidth,width=0.15\textwidth]{Figs/MNIST_resized_result_gates}}  
    }
    \parbox{.2\figrasterwd}{%
     \centering
      \subcaptionbox{\label{fig:tree_umap}}{\includegraphics[height=0.3\textwidth,width=0.15\textwidth]{Figs/synthetic_tree_umap}}
      }
     \parbox{.5\figrasterwd}{%
     \centering
      \subcaptionbox{\label{fig:tree_shared}}{\includegraphics[height=0.14\textwidth,width=0.5\textwidth]{Figs/synthetic_tree_losses_shared}}
      \subcaptionbox{\label{fig:tree_diff}}{\includegraphics[height=0.14\textwidth,width=0.5\textwidth]{Figs/synthetic_tree_losses_diff}}  
    }
    }
    \caption{Left (a-b): Evaluation of the proposed approach on the rescaled MNIST dataset. (a): Random images from modality $X$ (upper row) and modality $Y$ (bottom row) in gray-scale. (b): Selected pixels (dark blue) for the shared operator (left column) and the differential operator (right column). Right (c-e): Synthetic developmental tree example. (c):  UMAP embeddings of the tree using data from modality $\myX$ (top) and modality $\myY$ (bottom). (d-e): Change of the Shared/Differential Laplacian Scores, regularization loss, and the F1-score of the selected features concerning the number of epochs (x-axis) for mmDUFS with the shared operator (panel (c)) and the differential operator (panel (e)).}%
    \label{fig:MNIST_result}%
 
\end{figure*}



We apply mmDUFS with the shared operator to this example to select pixels corresponding to $3$. The left column of Fig. \ref{fig:mnist_gates} shows the pixels gate values from mmDUFS for  modality $\myX$ (top) and $\myY$ (bottom). We can see that selected pixels outline the shape of the digit $3$ well. Table \ref{tab:f1_all} compares the F1-score achieved by mmDUFS to three baselines. %To further quantify the feature selection performance of mmDUFS, we compare it to different baseline methods in terms of F1-score in Table \ref{tab:f1_all}. 
We can see that mmDUFS achieves a higher F1-score than all the baselines on both modalities, demonstrating its ability to identify informative features accurately.

Next, we apply mmDUFS with the differential operator to select modality-specific pixels. The right column of Fig. \ref{fig:mnist_gates} shows the pixel gate values for both modality $\myX$ (top) and $\myY$ (bottom). We can see that mmDUFS selects pixels that outline digits $0,8$ for modalities $\myX,\myY$, respectively. Additionally, mmDUFS achieves F1-score $0.8059$ and $0.8832$ for  $\myX$ and $\myY$, showcasing its effectiveness in identifying features contributing to the differential structures.

Lastly, we demonstrate that our model can be extended and applied to scenarios where there are more than $2$ modalities. We extend this rescaled MNIST dataset by adding another modality ($\myvec{Z}$), which contains $500$ concatenated images of rescaled digits 3 and 4. Therefore, digit 3 is shared across all three modalities. We apply mmDUFS with this extended shared operator on this dataset to select pixels corresponding to 3. In Supplementary Table 1, we can see that mmDUFS outperforms all the baselines in terms of the F1-score, demonstrating its ability to accurately identify informative features in multimodal scenarios.

\begin{table}[htb!]%
       \centering
       \begin{adjustbox}{max width=0.48 \textwidth,min height = 0.5 in, valign=c}
             \begin{tabular}{|c|c||c|c|c|c|}
                \hline 
              Dataset & Modality & MC & mmKS & mmKP & mmDUFS \\
              \hline

             \multirow{2}{*}{Rescaled MNIST} &  X  &  0.3547 & 0.5291 & 0.5291 & \textbf{0.7093}\\%\textbf{0.6919} \\
            
                & Y & 0.4826 & 0.6219 & 0.6219 & 
                 \textbf{0.8159}\\

            \hline
           \multirow{2}{*}{Synthetic Developmental Tree} & X & 0.6000 & 0.7800 &  0.8400 & \textbf{0.8800}\\
             &  Y &  0.7800 & 0.8000 & 0.8200 & \textbf{0.9000} \\
            \hline
              \multirow{2}{*}{Original Gaussian} & X & 0.5000 & 0.7333 & \textbf{1} & \textbf{1}\\
            & Y &  0.5500 & 0.6500 & 0.9500 & \textbf{1} \\
             
             \multirow{2}{*}{Gaussian + $10$ Noisy Feats} & X & 0.5000 & 0.7333 & \textbf{1} & \textbf{1} \\
                                               & Y &  0.5000 & 0.6500 & 0.9000 & \textbf{1} \\
            
            \multirow{2}{*}{Gaussian + $30$ Noisy Feats} & X & 0.4667 & 0.7000 & 0.9667 & \textbf{1} \\
                                              & Y & 0.4500 & 0.5500 & 0.8500 & \textbf{1} \\
           
            \multirow{2}{*}{Gaussian + $50$ Noisy Feats} & X & 0.4000 & 0.6333 & 0.9333 & \textbf{0.9667} \\
                                              & Y & 0.4000 & 0.5500 & 0.8000 & \textbf{0.8500} \\
            \hline
        
              
         \end{tabular}%
         \end{adjustbox}
   
    \caption{Comparing F1-score of the features associated with the shared structures between different methods on the rescaled MNIST example, the synthetic tree example, and the Gaussian mixture example with different numbers of additive noisy features.}%
  
    \label{tab:f1_all}%
   
 
\end{table}

\paragraph{Synthetic Developmental Tree.}

Tree structures are ubiquitous throughout different biological processes and data modalities in single-cell biology \citep{plass2018cell,zhang2021single}. To understand the interplay of different mechanisms underlying the complex developmental process, it is vital to discover the genetic features that contribute to the tree structure shared across modalities and those that contribute to modality-specific structures. 

We evaluate mmDUFS using a simulated developmental tree example generated via a tree simulator \footnote{https://github.com/dynverse/dyntoy}. The original data has $1000$ samples and $100$ features. We divide the data into half, such that each modality has $50$ informative features that contribute to the shared tree structure, as shown in the UMAP embeddings in Fig. \ref{fig:tree_umap}, where the samples in the tree are grouped into different branch groups (labeled $G_1$ to $G_6$). We then add $50$ features drawn from negative binomial distributions to each modality 
to create differential branches, that are only observed in one modality. Specifically, branches $G_1$ and $G_2$ are bifurcated in modality $\myX$ (top UMAP embeddings) but are mixed in modality $\myY$ (bottom UMAP embeddings), and $G_3$ and $G_4$ are bifurcated in modality $\myY$ but are mixed in modality $\myX$ (see Supplementary section B.3 for further details). After log transformation and z-scoring the data, we concatenate $200$ features drawn from $N(0,1)$ to each modality as noisy features.


We apply our model with the shared and differential operators to recover the features that contribute to the overall tree structure and the set of features that contribute to the split branches, respectively. Fig. \ref{fig:tree_shared} shows the change, during training with the shared loss, in the Shared/Differential Laplacian Scores, the regularization loss, and the F1-score. 
Fig. \ref{fig:tree_diff} shows the same properties for the differential loss.
%In both cases, mmDUFS selects the informative features that high scores and removes the noisy features with the training progress. 
Table \ref{tab:f1_all} compares the F1-score of the selected features between different methods. Here as well,  mmDUFS clearly outperforms the other methods.


\begin{figure*}[htb!]%
\centering
   \parbox{\figrasterwd}{
   \centering
    \parbox{.35\figrasterwd}{%
     \centering
      \subcaptionbox{\label{fig:video_samples}}{\includegraphics[height=0.26\textwidth,,width=0.3\linewidth]{Figs/video_result_new_samples}} 
      \centering
      \subcaptionbox{\label{fig:video_shared}}{\includegraphics[height=0.26\textwidth,,width=0.3\linewidth]{Figs/video_result_new_shared}}  
     \centering
      \subcaptionbox{\label{fig:video_diff}}{\includegraphics[height=0.26\textwidth,,width=0.3\linewidth]{Figs/video_result_new_diff}}  
      
    }
    \parbox{.2\figrasterwd}{%
     \centering
      \subcaptionbox{\label{fig:cbmc_umap}}{\includegraphics[height=0.26\textwidth,,width=0.7\linewidth]{Figs/cbmc_subset_umap}}
    }
    \parbox{.4\figrasterwd}{%
      \centering
      \subcaptionbox{\label{fig:cbmc_genes}}{\includegraphics[height=0.26\textwidth,width=\linewidth]{Figs/cbmc_subset_genes}}
    }
    }

    \caption{Left (a-c): Rotating dolls example. (a): Random images of the dolls from each video. (b-c): Selected pixels are marked in blue for mmDUFS with shared operator (b) and the differential operator (c).  Right (d-e): CITE-seq data example. (d): UMAP embeddings using the RNA (top) and protein data (bottom), colored by cell type labels. (e): Similar UMAP embeddings colored by the expression level of several genes selected by mmDUFS with the differential operator. }%
    
    \label{fig:video_result}%
 
\end{figure*}


\paragraph{Synthetic Gaussian Mixtures.}

We generated a multi-modal Gaussian mixture dataset, where $\myX$ and $\myY$ each have three clusters. Two clusters are shared between modalities, and one cluster is specific to each modality. The observations in each modality include features informative of the clusters, along with noisy features (see Appendix B.2). 

We apply mmDUFS to uncover the informative features of the shared clusters and the modality-specific clusters. In Fig. 3 of Supplementary section B.2, we plot the change of the average shared/differential Laplacian Scores across features, the regularization loss, and the F1-score of the selected features from mmDUFS with respect to the number of epochs. MmDUFS gradually selects the correct features while dropping the non-informative ones. To evaluate mmDUFS's feature selection capability in challenging regimes, we inject $10$, $30$, and $50$ noisy features into each modality and compare the F1-score of features selected by different methods in each regime. Table \ref{tab:f1_all} shows that mmDUFS consistently outperforms the baseline methods,  and maintains its accuracy even in challenging regimes. 






\subsection{Real Data}
\paragraph{Rotating Dolls.}

%Next, we evaluate mmDUFS's performance on a dual-modal rotating toy dataset \citep{lederman2018learning}. In this data, $3$ stuffed toys, a Yoda, a dog, and a rabbit, are placed on top of a rotation stand, and $2$ cameras are set up from $2$ different angles to film the toys simultaneously while they are rotating (shown in Fig. \ref{fig:video_samples}, the top is video $1$, and the bottom is video $2$). 

%By design, video $1$ only captures the movement of the Yoda and the dog, whereas video $2$ only captures the movement of the dog and the rabbit. In this setup, the dog is the shared structure between the $2$ videos, and the Yoda and the rabbit are specific to each video. By treating each frame of the shots as one sample ($4050$ in total) and the gray-scaled pixels as features, we aim to uncover the toys corresponding to each structure using mmDUFS.


We evaluate mmDUFS's performance on the rotating doll video dataset described in Sec. \ref{sec:joint_op} in which $2$ cameras capture $2$ dolls from different angles (Fig. \ref{fig:video_samples}). By treating each video frame as one sample ($4050$ in total) and the gray-scaled pixels as features, we aim to uncover pixels that correspond to the shared doll (the dog) and the modality-specific dolls (Yoda and rabbit).

For mmDUFS with the shared operator, Fig. \ref{fig:video_shared} shows selected pixels in both videos, as indicated by the blue dots. The shape of the dog is clearly delineated in both modalities. We further compute the F1-score of the selected pixels with respect to the underlying pixels that correspond to the dog. mmDUFS achieves F1-score of $0.7158$ and $0.8033$ for the two modalities, whereas MC achieves $0.2390$ and $0.3822$, and mmKS and mmKP achieve $0.5452$ and $0.6868$. 
Fig. \ref{fig:video_diff} shows the selected pixels of mmDUFS with the differential operator in the two videos. In video 1, mmDUFS select mostly pixels corresponding to the Yoda (F1-score: $0.8861$). For video 2, mmDUFS select mostly pixels corresponding to the rabbit (F1-score: $0.7446$). 

To demonstrate that our model can extract useful information from high-dimensional measurements, we use the selected features to estimate of rotation angles of the shared doll (the dog). For computing the \textit{ground truth} rotation angles, we first compute the top 25\% pixels with the highest standard deviation. Then, we keep only the features that  belong to the dog, and compute the angle via the leading two Laplacian eigenvectors computed based on these pixels. 
%We first identify the informative features as the top 25\% pixels with the highest standard deviation, then we crop out the pixels corresponding to the dog and compute the diffusion map coordinates using these pixels. For every two consecutive images (samples), we calculate the relative rotation angles using the diffusion map coordinates and define these angles as the ground truth rotation angles. 
Next, we compute the estimated angles that are based on features 
detected by mmDUFS and the other baseline methods. 
%apply different baselines to this data to select pixels corresponding to shared information. Based on these pixels, we again estimate the rotation angles. To compare the performance of different methods, 
For comparison, we compute the mean squared error between the estimated angles and the ground truth, as shown below in Table \ref{tab:mse_doll}. We can see that mmDUFS outperforms other methods, which shows that it can improve the capability of extracting latent information in multimodal data in unsupervised settings.

\begin{table}[!htb]
    \centering
    \begin{adjustbox}{max width=0.4 \textwidth, valign=c}
    \begin{tabular}{|c|c|c|c|c|}
         \hline
         Modality &  MC & mmKS & mmKP & mmDUFS\\
         \hline
         $X$& 0.4559  & 0.1146 & 0.1146 & \textbf{0.0150}\\
         \hline
         $Y$ & 0.5353 & 0.0509 & 0.0509 & \textbf{0.0426} \\
         \hline
    \end{tabular}
    \end{adjustbox}
    \caption{MSE of the estimated doll rotation angles}
    \label{tab:mse_doll}
\end{table}


\paragraph{CITE-seq Dataset.}
In single-cell biology, cell states are characterized by different features at different molecular levels. Identifying the contributing features is an open question crucial to understanding the underlying cell systems. We apply mmDUFS to a human cord blood mononuclear cells (CBMCs) CITE-seq dataset from \citep{stoeckius2017simultaneous}, in which cells are profiled at both transcriptomic and proteomic levels measuring expressions of genes and protein markers, to identify the genes and proteins that characterize the cell states in the multi-modal setting. 

In this data, a group of murine cells is spiked-in as controls. Fig. \ref{fig:cbmc_umap} shows  UMAP embeddings of the cells based on their RNA expression (top) and protein expression (bottom).  From the full dataset, we analyzed $3$ cell populations: murine cells (blue) and $2$ CBMCs cell populations (Erythroids (orange) and CD34+ cells (green)). This dataset has $832$ cells, with $500$ top variable genes from modality 1 and $10$ protein markers from modality 2. We can see that the murine cells are separable from the Erythroids in the RNA space but not in the proteomic space.
We apply mmDUFS with the differential operator to this data to identify which gene markers contribute to the separation between cell groups. 

To evaluate the quality of each set of selected features, we used each set to train an SVM model to classify Erythroids and murine cells (i.e., the differential structures). With an $5\%$ / $95\%$ training/test split, MC/mmKS/mmKP achieve $96.97\%$ / $93.80\%$ / $93.80\%$ average balanced test accuracy, respectively, whereas mmDUFS achieve $97.52\%$ average balanced test accuracy (repeated $10$ times). Examining the selected genes by each model, we found that mmDUFS mostly selects murine genes.  These murine genes are exclusively expressed in murine cells, as shown in  Fig. \ref{fig:cbmc_genes}, thus we expect these genes can better separate the two cell types. In summary, this result shows that mmDUFS can better preserve modality-specific structure (two separable cell types) and the informative features that are relevant to the structure in single-cell multi-omic data.


\section{Discussion}

We present mmDUFS, a feature selection method that learns two novel graph operators that capture the \textit{shared} and the \textit{modality-specific} structures in multi-modal data, while simultaneously selecting the features that are informative for these structures. MmDUFS can operate on small batches which makes it scalable to large datasets. On the other hand, finding the optimal regularization parameters for mmDUFS on real data may be challenging, for which we suggest an automatic procedure in Appendix B.1. 
A second potential limitation is the $\mathcal O(n^3)$ computational complexity required to compute $ \tilde{\myvec{L}}$ (Eq. \eqref{eq:diff_op_y}). A possible solution is to reduce the complexity by computing a sparse Laplacian matrix. 

\subsubsection*{Acknowledgement}

The authors thank Amit Moscovich for helpful discussions and feedback. 
Y.K. acknowledges support by grant R01GM131642, UM1PA051410, R33DA047037, U54AG076043, U54AG079759, U01DA053628, P50CA121974, R01GM135928 
\bibliography{yang_240}

% References

\end{document}
