% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} 


\title{Greedy Modality Selection via Approximate Submodular Maximization}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    % \bibliographystyle{plainnat}
    \bibliographystyle{abbrvnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
% \usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
% \usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
% \usepackage{nicefrac}       % compact symbols for 1/2, etc.
% \usepackage{microtype}      % microtypography
\usepackage{xspace}
\usepackage{xcolor}         % colors

% new packages
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bbm}
% \usepackage{algorithm}
% \usepackage{algpseudocode}
\usepackage[ruled,vlined]{algorithm2e}
\usepackage{thmtools}
\usepackage{thm-restate}
\usepackage[capitalise]{cleveref}
\usepackage{bm}

% function class, spaces, operator, etc.
\newcommand{\fX}{\mathcal{X}}
\newcommand{\fY}{\mathcal{Y}}
\newcommand{\fD}{\mathcal{D}}
\newcommand{\fH}{\mathcal{H}}


% big sets
\newcommand{\sR}{\mathbb{R}}
\newcommand{\sN}{\mathbb{N}}
\newcommand{\sZ}{\mathbb{Z}}

% other
\newcommand{\iid}{i.i.d.}
\newcommand{\1}[1]{\mathbbm{1}(#1)}
\newcommand{\Exp}{\mathbb{E
}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\Cov}{\mathrm{Cov}}
\newcommand{\Res}{\mathrm{Res}}
% util function 
\newcommand{\util}{f_u}
\newcommand{\wrt}{w.r.t.\xspace}
\newcommand{\nphard}{$\mathcal{NP}$-hard\xspace}
\newcommand{\npcomplete}{$\mathcal{NP}$-complete\xspace}
\newcommand{\sharpphard}{$\mathcal{\sharp P}$-hard\xspace}
% name replace
\newcommand{\mm}{multimodal\xspace}
\newcommand{\mmcap}{Multimodal\xspace}
\newcommand{\polytime}{polynomial time\xspace}
\newcommand{\greedyset}{greedily-obtained set\xspace}
\newcommand{\greedyvalue}{greedily-obtained value\xspace}
\newcommand{\optimalset}{optimal set\xspace}
\newcommand{\optimalvalue}{optimal value\xspace}
% experiment
\newcommand{\pmnist}{Patch-MNIST\xspace}
\newcommand{\mnist}{MNIST\xspace}
\newcommand{\pems}{PEMS-SF\xspace}
\newcommand{\mosi}{CMU-MOSI\xspace}

% algorithm2e
\SetKwInput{KwInput}{Input}                % Set the Input
\SetKwInput{KwOutput}{Output}              % set the Output

\renewcommand{\qedsymbol}{$\blacksquare$}
\newcommand{\defeq}{\vcentcolon=}
\newcommand{\eqdef}{=\vcentcolon}
\newcommand{\eps}{\varepsilon}
\newcommand{\R}{\mathbb{R}}

\newcommand{\suchthat}{\xspace\ s.t.\ \xspace}
\newcommand{\KLD}[2]{D_{\mathrm{KL}}(#1~\|~#2)}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

% for restatement
\declaretheorem[name=Theorem,numberwithin=section]{thm}
\declaretheorem[name=Corollary,numberwithin=section]{cor}
\declaretheorem[name=Proposition,numberwithin=section]{prop}
\declaretheorem[name=Lemma,numberwithin=section]{lem}
\declaretheorem[name=Definition,numberwithin=section]{defi}
\declaretheorem[name=Assumption,numberwithin=section]{as}

\declaretheorem[name=Algorithm,numberwithin=section]{algo}

\theoremstyle{definition}
\newtheorem*{remark}{Remark}

% editing
\newcommand{\comment}[1]{}
\newcommand{\fix}[1]{\textcolor{red}{#1}}
\newcommand{\han}[1]{\textcolor{blue}{\textbf{[Han]: #1}}}
\newcommand{\sam}[1]{\textcolor{red}{[Sam]: #1}}
\newcommand{\gargi}[1]{\textcolor{magenta}{\textbf{[Gargi]: #1}}}
\newcommand{\yifei}[1]{\textcolor{cyan}{[Yifei]: #1}}

\begin{document}

% Add authors
% \author[1]{anonymous}
\author[1]{\href{mailto:<rcheng12@illinois.edu>}{Runxiang Cheng$^{*}$}{}}
\author[1]{\href{mailto:<gargib2@illinois.edu>}{Gargi Balasubramaniam$^{*}$}{}}
\author[1]{\href{mailto:<yifeihe3@illinois.edu>}{Yifei He\thanks{Equal contribution.}}{}}
\author[2]{\href{mailto:<yaohungt@cs.cmu.edu>}{Yao-Hung Hubert Tsai}{}}
\author[1]{\href{mailto:<hanzhao@illinois.edu>}{Han Zhao}{}}
% Add affiliations after the authors
\affil[1]{%
    University of Illinois Urbana-Champaign, Illinois, USA
}
\affil[2]{%
    Carnegie Mellon University, Pennsylvania, USA
}

\maketitle

% \input{abstract}
\begin{abstract}

Multimodal learning considers learning from multi-modality data, aiming to fuse heterogeneous sources of information. However, it is not always feasible to leverage all available modalities due to memory constraints. Further, training on all the modalities may be inefficient when redundant information exists within data, such as different subsets of modalities providing similar performance. In light of these challenges, we study \emph{modality selection}, intending to efficiently select the most informative and complementary modalities under certain computational constraints. We formulate a theoretical framework for optimizing modality selection in multimodal learning and introduce a utility measure to quantify the benefit of selecting a modality. For this optimization problem, we present efficient algorithms when the utility measure exhibits monotonicity and approximate submodularity. We also connect the utility measure with existing Shapley-value-based feature importance scores. Last, we demonstrate the efficacy of our algorithm on synthetic (Patch-MNIST) and real-world (PEMS-SF, CMU-MOSI) datasets.


% \comment{Inspired by diminishing returns, w}
% However, there are emerging scenarios like multimodal activity recognition where there may be a significant number of modalities (e.g. different sensors, camera views) with overheads required to collect and maintain them. 
\end{abstract}

% \input{introduction}
\section{Introduction}


% generic introduction of multimodal learning, how multimodal is better than unimodal. Motivate the problem of multimodal selection.
% \han{I was also thinking to write a note explicitly as a footnote to mention that in this paper we use the terms modality/view interchangably.}

\mmcap learning considers learning with data from multiple modalities (e.g., images, text, speech, etc) to improve generalization of the learned models by using complementary information from different modalities.\footnote{We use the terms modality/view interchangably.} In many real-world applications, \mm learning has shown superior performance~\citep{bapna2022mslam, wu2021n}, and has demonstrated a stronger capability over learning from a single modality. The advantages of \mm learning have also been studied from a theoretical standpoint. Prior work showed that learning with more modalities achieves a smaller population risk~\citep{huang2021makes}, or utilizing cross-modal information can provably improve prediction in multiview learning\comment{ with missing views }~\citep{zhang2019cpm} or semi-supervised learning~\citep{sun2020tcgm}.
% \han{The transition here seems a bit abrupt, I would suggest putting the following sentence to make the transition more smooth. For example, you could even cite the very recent blog to further justify your argument: %\url{https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/}{https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/}.
% }
With the recent advances in training large-scale neural network models from multiple modalities \citep{devlin2018bert,brown2020language}, one emerging challenge lies in the \emph{modality selection} problem.

% \gargi{Suggestion to rephrase BEGIN}:\\
% In an attempt to solve the modality selection problem, we aim to provide a  theoretical grounding to the following intuitions:
% \begin{itemize}
%     \item Can we quantify the benefit i.e. the utility contribution of a modality to the training process in \mm learning? 
%     \item Does the marginal utility of adding a modality exhibit properties which can be leveraged to select a near optimal subset with guarantees on the training performance?
% \end{itemize}
% Note that it may be inefficient to learn from the entire set modalities as the total number of input modalities increases. One reason is that model complexity can scale linearly or even exponentially with the number of input modalities (\cite{zadeh2017tensor,liu2018efficient}), raising large consumption in computational and energy resources. Furthermore, being able to select the most beneficial modalities can help proactively reduce the cost of collecting modalities which do not contribute as much. For example, in sensor placement problems, finding the optimal subset of sensors for a learning objective (e.g., temperature or traffic prediction) eliminates the cost of maintaining extra sensors (\cite{krause2011robust}). 
% In this context, we formulate a theoretical framework of the optimization problem of modality selection in \mm learning. To this end, our contributions are as follows:
% \begin{enumerate}
%     \item We \textit{quantify} the benefit of a modality to the training process by introducing a utility measure.
%     \item We present a setting under which this utility exhibits properties of monotonocity and approximate submodularity, allowing us to come up with efficient approximate algorithms for this optimization problem.
%     \item We establish a novel correspondence between our utility measure and existing marginal contribution scores based on the Shapely Value under our proposed settings, which further justifies the choice of this utility. 
% \end{enumerate}
% \gargi{Suggestion to rephrase END}
% Our work is based on the intuition of conditional independence between modalities given the target variable. \gargi{We need to show that this is different from the one used in multiview learning}
% \gargi{State that benefit of our assumption on modality level >> that on feature level}



From the modeling perspective, it might be tempting to use all the modalities available. However, it is inefficient or even infeasible to learn from all modalities as the total number of input modalities increases. A modality often consists of high-dimensional data. And model complexity can scale linearly or exponentially with the number of input modalities~\citep{zadeh2017tensor,liu2018efficient}, resulting in large consumption of computational and energy resources. The marginal benefit from the new modalities may also decrease as more modalities have been included. In some cases, learning from fewer modalities is sufficient to achieve the desirable outcome, due to the potential overlap in the information provided by these modalities. Furthermore, proactively selecting the modalities most informative towards prediction reduces the cost of collecting the inferior ones. For example, in sensor placement problems where each sensor can be treated as a modality, finding the optimal subset of sensors for a learning objective (e.g., temperature or traffic prediction) eliminates the cost of maintaining extra sensors~\citep{krause2011robust}.



In light of the aforementioned challenges, in this paper, we study the optimization problem of modality selection in \mm learning: given a set of input modalities and a fixed budget on the number of selected modalities, how to select a subset that optimizes prediction performance? Note that in general this problem is of combinatorial nature, since one may have to enumerate all the potential subsets of modalities in order to find the best one. Hence, without further assumptions on the structure of the underlying prediction problem, it is intractable to solve this modality selection problem exactly and efficiently. 


% To approach the above challenges, \han{write the following part at the end. The current summarization on the contributions does not sound very exciting to me. Some of the points to be mentioned in the contribution section include: 1). identifying a proper assumption that is suitable for multimodal/multiview learning that helps us to develop efficient approximate algorithms; 2). provable guarantee on the performance of the selected subset for prediction; 3). connection to Shapley values, MCI, etc.}

% definition of utility.

To approach these challenges, we propose a utility function that conveniently quantifies the benefit of any set of modalities towards prediction in most typical learning settings. We then identify a proper assumption that is suitable for \mm/multiview learning, which allows us to develop efficient approximate algorithms for modality selection. We assume that the input modalities are approximately conditionally independent given the target. Since the strength of conditional independence is now parameterized, our results are generalizable to problems on \mm data with different levels of conditional independence.

We show that our definition of utility for a modality naturally manifests as the Shannon mutual information between the modality and the prediction target, in the setting of binary classification with cross-entropy loss.
% We focus on this problem setting for the simplicity of discussion.
Under approximate conditional independence, mutual information is monotone and approximately submodular. These properties intrinsically describe the empirical advantages of learning with more modalities, and allow us to formulate modality selection as a submodular optimization problem. In this context, we can have efficient selection algorithms with provable performance guarantee on the selected subset. For example, we show a performance guarantee of the greedy maximization algorithm from \citet{nemhauser1978analysis} under approximate submodularity.
Further, we connect modality selection to marginal-contribution-based feature importance scores in feature selection. We examine the Shapley value and Marginal Contribution Feature Importance (MCI)~\citep{catav2021marginal} for ranking modality importance. We show that these scores, although are originally intractable, can be solved efficiently under assumptions in the context of modality selection. Lastly, we evaluate our theoretical results on three classification datasets. The experiment results confirm both the utility and the diversity of the selected modalities. To summarize, we contributes the following in this paper:
\begin{itemize}
    \item Propose a general measure of modality utility, and identify a proper assumption that is suitable for \mm learning and helpful for developing efficient approximate algorithms for modality selection.
    \item Demonstrate algorithm with performance guarantee on the selected modalities for prediction theoretically and empirically in classification problems with cross-entropy loss.
    \item Establish theoretical connections between modality selection and feature importance scores, i.e., Shapley value and Marginal Contribution Feature Importance. 
\end{itemize}

% \han{Could consider using a list of bullet points to succinctly describe the contributios so that it is easier for readers to follow and appreciate. The current one is a bit too rough (not concrete enough) and vague. } Our main contribution is providing an initial theoretical formulation to the modality selection problem, and a study on potential approaches and theoretical implications to this problem.

% \input{preliminary}
\section{Preliminaries}
\label{sec:preliminary}


In this section, we first describe our notation and problem setup, and we then provide a brief introduction to submodular function maximization and feature importance scores.

\subsection{Notation and Setup}
We use $X$ and $Y$ to denote the random variables that take values in input space $\fX$ and output space $\fY$, respectively. The instantiation of $X$ and $Y$ is denoted by $x$ and $y$. We use $\fH$ to denote the hypothesis class of predictors from input to output space, and $\hat Y$ to denote the predicted variable. Let $\fX$ be multimodal, i.e., $\fX = \fX_1 \times ... \times \fX_k$. Each $\fX_i$ is the input from the $i$-th modality. We use $X_i$ to denote the random variable that takes value in $\fX_i$, and $V$ to denote the full set of all input modalities, i.e., $V = \{X_1, ..., X_k\}$. Throughout the paper, we often use $S$ and $S'$ to denote arbitrary subsets of $V$. Lastly, we use $I(\cdot, \cdot)$ to mean the Shannon mutual information, $H(\cdot)$ for entropy, $\ell_{ce}(Y, \hat Y)$ for the cross-entropy loss $\1{Y=1}\log\hat{Y} + \1{Y=0}\log(1-\hat{Y})$, and $\ell_{01}(Y, \hat Y)$ for zero-one loss $\1{Y\neq \hat Y}$.  

For the simplicity of discussion, we primarily focus on the setting of binary classification with cross-entropy loss\footnote{We choose binary class setting for ease of exposition, our general proofs and results directly extend to multi-class setting. 
We have only used the binary case to derive the conditional entropy (supplementary material), and to further showcase \cref{cor:greedyLossBound}
}.
% \han{What does it mean to be ``given output $Y$'' here?}
In this setting, a subset of input modalities $S \subseteq V$ and output $Y \in \{0, 1\}$ are observed. The predictor aims to make prediction $\hat Y \in [0, 1]$ which minimizes the cross-entropy loss between $Y$ and $\hat Y$. The goal of modality selection is to select the 
% \han{avoid using adjective like ``beneficial'' here since it is ambiguous without a commonly agreed definition. Saying the subset of modalities to minimize the loss is clear enough}
subset of input modalities to this loss minimization goal under certain constraints. Our results rely on the following assumption to hold. 
% We assume the same set of modalities will be available during both training and testing, but not all $k$ modalities are always available. In other words, input data is from only a subset of modalities of the whole input space.

\begin{restatable}[$\epsilon$-Approximate Conditional Independence]{as}{eCondIndep}
\label{as:eCondIndep}
There exists a positive constant $\epsilon \geq 0$ such that, $\forall S, S'\subseteq V, S\cap S' = \emptyset$, we have $I(S; S' \mid Y) \leq \epsilon$.
\end{restatable}
Note that when $\epsilon = 0$, \cref{as:eCondIndep} reduces to strict conditional independence between disjoint modalities given the target variable. In fact, this is a common assumption used in prior work in multimodal learning~\citep{white2012convex, wu2018multimodal,sun2020tcgm}. In practice, however, strict conditional independence is often difficult to be satisfied. Thus, we use a more general assumption above, in which input modalities are approximately conditionally independent. In this assumption, the strength of the conditional independence relationship is controlled by a positive constant $\epsilon$, which is the upper bound of the conditional mutual information between modalities given the target. 

\textbf{Connection to feature selection.}~~It is worth mentioning that modality selection shares a natural correspondence to the problem of feature selection. Without loss of generality, a modality could be considered as a group of features; theoretically, the group could even contain a single feature in some settings. But a distinction between these two problems lies in the feasibility of conditional independence. In \mm learning where input data is often heterogeneous, the (approximate) conditional independence assumption is more likely to hold among input modalities. Whereas in the feature level, such an assumption is quite restrictive~\citep{zhang2012kernel}, as it boils down to asking the data to approximately satisfy the Naive Bayes assumption.



\subsection{Submodular Optimization}
Submodularity is a property of set functions that has many theoretical implications and applications in computer science. A definition of submodularity is as follows, where $2^V$ denotes the power set of $V$, and the set function $f$ assigns each subset $S\subseteq V$ to a value $f(S)\in\R$.

\begin{restatable}[\cite{nemhauser1978analysis}]{defi}{submodularity}
\label{defi:submodularity}
Given a finite set $V$, a function $f: 2^V \to \sR$ is submodular if for any $ A \subseteq B \subseteq V$, and $e \in V \setminus B$, we have $f(A\cup \{e\}) - f(A) \geq f(B\cup \{e\}) - f(B)$.
\end{restatable}

In other words, adding new elements to a larger set does not yield larger marginal benefit comparing to adding new elements to its subset. One common type of optimization on submodular function is submodular function maximization with cardinality constraints. It asks to find a subset $S \subseteq V$ that maximizes $f(S)$ subject to $|S| \leq q$. Finding the optimal solution to this problem is \nphard. However, \citet{nemhauser1978analysis} propose that a greedy maximization algorithm can provide a solution with approximate guarantee to the optimal solution in \polytime. We provide the pseudocode of this greedy algorithm below.  
% \han{In the pseudocode of Algorithm 1, why do you need the $p$ here?}

\begin{algorithm}
\DontPrintSemicolon
\caption{Greedy Maximization}\label{algo:greedyOrginal}
    \KwData{Full set $V = \{X_1, ..., X_k\}$, constraint $q \in \sZ^{+}$.}
    \KwInput{$f: 2^V\to \sR$, and $p \in \sZ^{+}$, where $p \leq q \leq |V|$}
    \KwOutput{Subset $S_p$}
    $S_0 = \emptyset$\;
    \For{$i = 0, 1, ..., p-1$}{
        $X^i = \argmax_{X_j \in V \setminus S_{i}} (f(S_{i} \cup \{X_j\}) - f(S_{i}))$\;
        $S_{i+1} = S_{i} \cup \{X^i\}$\;
    }
\end{algorithm}
In this algorithm, $V$ is the full set to select elements from, $f$ is the submodular function to be maximized, $p$ is the number of iterations for the algorithm to run, and $q$ is the cardinality constraint.
% \han{It seems to me that $q$ is the iteration number instead?}
It starts with an empty set $S_0$, and subsequently adds to the current set $S_i$ the element $X^i$ that maximizes the marginal gain $f(S_{i} \cup \{X_j\}) - f(S_{i})$ at each iteration $i$. \cref{algo:greedyOrginal} runs in pseudo-polynomial time $\mathcal{O}(p|V|)$, and has an approximation guarantee as follows.

\begin{restatable}[\cite{nemhauser1978analysis}]{thm}{greedyBound}
\label{thm:greedyBound}
Let $q \in \sZ^+$, $S_p$ be the solution from \cref{algo:greedyOrginal} at iteration $p$, and $e$ is the Euler's number, we have:
\begin{equation}\label{eq:optgap}
    f(S_p) \geq (1-e^{-\frac{p}{q}})\max_{S: |S|\leq q}f(S)
\end{equation}
\end{restatable}

$\max_{S: |S|\leq q}f(S)$ is the \optimalvalue from the optimal subset whose cardinality is at most $q$. If $f$ is monotone, $\argmax_{S: |S|\leq q} f(S)$ has cardinality exactly $q$. By running \cref{algo:greedyOrginal} for exactly $q$ iterations, we obtain a \greedyvalue that is at least $1-\frac{1}{e}$ of the \optimalvalue.


\comment{
Other than maximization, one might be interested in finding the minimum-cardinality subset that achieves the submodular value of the full set, i.e., $\min_{S\subseteq V} |S|$ subject to $f(S) = f(V)$. This problem is tightly related to problems (e.g., set covering) studied in \cite{johnson1974approximation, dobson1982worst}, and it is \npcomplete. \cite{wolsey1982analysis} proves \cref{algo:greedyOrginal} can provide a solution with approximation guarantee to the optimal solution.
\begin{restatable}{thm}{minCoverage}
\label{thm:minCoverage}
Let $V$ be the full set, and $f: 2^V \to \sR$ be a monotone submodular function. Let $l$ be the smallest index such that $f(S_l) = f(V)$ where $S_l$ is the greedy solution, and $l^* = \min_{S\subseteq V} |S|$ such that $f(S) = f(V)$. We have:
\begin{equation}
    l \leq \left(1 + \ln{\frac{f(V) - f(\emptyset)}{f(V) - f(S_{l-1})}}\right)l^*
\end{equation}
\end{restatable}
In other words, constraining on $f(S_l) = f(V)$, the greedily achieved minimum cardinality $l$ is at most a multiplicative factor of the optimal minimum cardinality $l^*$.
}

\subsection{Feature Importance Scores}
\label{sec:preliminary:feature}

% Feature importance scores have been proposed to measure the contribution of individual features in machine learning (\cite{shapley1953value, deeplift2017, lundberg2017unified, frye2020asymmetric, covert2020understanding, catav2021marginal}).

The feature importance domain in machine learning studies scoring methods that measure the contribution of individual features. A common setting of these feature importance scoring methods is to treat each feature as a participant in a coalitional game, in which all of them contribute to an overall gain. Then a scoring method assigns each feature a importance score by evaluating their individual contributions. Many notable feature importance scores are adapted from the Shapley value, defined as follows:

\begin{restatable}[\cite{shapley1953value}]{defi}{shapley}
\label{defi:shapley}
Given a set of all players $F$ in a coalitional game $v: 2^F \to \sR$, the Shapley value of player $i$ defined by $v$ is:
\begin{equation}
    \phi_{v, i} = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|!(|F|-|S|-1)!}{|F|!} (v(S\cup \{i\}) - v(S))
\end{equation}
\end{restatable}

The Shapley value of a player $i$ is the average of its marginal contribution in game $v$ in each possible player subsets excluding $i$. The game $v$ is a set function that computes the gain of a set of players. Computing the exact Shapley value of a player is \sharpphard, and its complexity is exponential to the number of players in the coalitional game -- there are $\mathcal{O}(2^{|F|})$ unique subsets, and each subset $S$ could have a unique $\bigtriangleup v(i|S)$ value \citep{roth1988shapley,winter2002shapley}. Nonetheless, in certain game settings, there are approximation methods to Shapley value, such as Monte Carlo simulation \citep{faigle1992shapley}.
% ,fatima2008linear,michalak2013efficient
When Shapley value is adapted to the feature importance domain, each input feature is a player, and $v$ is also called the \textit{evaluation function}. But $v$ is not unique -- it can be a prediction model or utility measure; different $v$ may induce different properties to the Shapley value.

% We also examine another feature importance score from \citet{catav2021marginal}, known as the Marginal Contribution Feature Importance (MCI). \citet{kumar2020problems} has shown that Shapley-value-based methods \citep{shapley1953value,lundberg2017unified,covert2020understanding} could underestimate the importance of correlated features by assigning them lower scores if these features present together in the full set. In light of this, MCI is proposed to overcome this issue. It is defined as the maximum marginal contribution of all possible feature combinations.
We also examine another feature importance score from \citet{catav2021marginal}, known as the Marginal Contribution Feature Importance (MCI). \citet{kumar2020problems} has shown that Shapley-value-based feature importance scores \citep{shapley1953value,lundberg2017unified,covert2020understanding} could underestimate the importance of correlated features by assigning them lower scores if these features present together in the full set. In light of this, MCI is proposed to overcome this issue. In \cref{defi:mci}, MCI of a feature $i$ is the maximum marginal contribution in $v$ over all possible feature subsets. The complexity of computing the exact MCI of a feature is also exponential to the number of features. 

\begin{restatable}[\cite{catav2021marginal}]{defi}{mci}
\label{defi:mci}
Given a set of all features $F$, and a non-decreasing set function $v: 2^F \to \sR$, the MCI of feature $i$ evaluated on $v$ is:
\begin{equation}
    \phi_{v, i}^{mci} = \max_{S\subseteq F} (v(S\cup \{i\}) - v(S))
\end{equation}
\end{restatable}

% The computation of the exact Shapley value or exact MCI is exponential to the number of features. This is a common drawback of these feature importance scores, and prior work rely on approximate techniques for feature selection \citep{lundberg2017unified}. In our case of modality selection, we shall show that the exact computation of these values can be done in \polytime under certain assumptions.

% \input{result}
\section{Modality Selection}


This section presents our theoretical results. In  \cref{sec:function}, we introduce the utility function to measure the prediction benefit of modalities, and present its subsequent properties. In \cref{sec:algorithm}, we present theoretical guarantees of greedy modality selection via maximizing an approximately submodular function. In \cref{sec:feature}, we show computational advantages of feature importance scores (i.e., Shapley value and MCI) in the context of modality selection. Due to space limit, proofs are deferred to supplementary materials.

\subsection{Utility Function}
\label{sec:function}

In order to compare the benefit of different sets of input modalities in \mm learning, we motivate a general definition of utility function that can quantify the impact of a set of input modalities towards prediction.

% \han{IIRC, in the original definition $c$ is a constant in the output space directly, with $\ell(Y, c)$.} \sam{Yes, i changed it, but actually $c$ is fine}
\begin{restatable}{defi}{utility}
\label{defi:utility}
Let $c$ be some constant in the output space, and $\ell(\cdot, \cdot)$ be a loss function. For a set of input modalities $S \subseteq V$, the utility of $S$ given by the utility function $\util: 2^V \to \sR$ is defined to be:
\begin{equation}
    \util(S) \coloneqq \inf_{c\in \fY} \Exp[\ell(Y, c)]-\inf_{h\in\fH} \Exp[\ell(Y, h(S))]
\end{equation}
\end{restatable}

In other words, the utility of a set of modalities $\util(S)$ is the reduction of the minimum expected loss in predicting $Y$ by observing $S$ comparing to observing some constant value $c$. The intuition is based on the phenomena that \mm input tends to reduce prediction loss in practice. 
% \sam{Potential citations: \cite{degroot, pmlr-v139-catav21a}.}
% Note that the above formulation is widely used in both information theory and machine learning context. \cite{degroot} defined the reduction of the expected uncertainty after obtaining new results from an experiment in an equivalent way. Also, \cite{pmlr-v139-catav21a} used universal predictive function to quantify the amount of information that can be extracted from the features on the target variable. Their function also has the same form as above.
\cref{defi:utility} can be easily interpretable in different loss functions and learning settings. Note that it is also used in feature selection to measure the unversial predictive power of a given feature \citep{covert2020understanding}. Under the binary classification setting with cross-entropy loss, $\util$ is the Shannon mutual information between the output and multimodal input. 
% \han{I would call the following a proposition rather than a lemma, since lemma is often used for technical tools towards proving a main theorem. The following result is more like a claim.}
\begin{restatable}{prop}{utilityCE}
\label{prop:utilityCE}
Given $Y \in \{0, 1\}$ and $\ell(Y, \hat Y) \coloneqq \1{Y=1}\log\hat{Y} + \1{Y=0}\log(1-\hat{Y})$, $\util(S) = I(S; Y)$.
\end{restatable}

The result above is well-known, and has also been proven in~\citep{grunwald2004game,farnia2016minimax}. We further can show that $I(S; Y)$ is monotonically non-decreasing on the set of input modalities $S$.
% \han{Again, use proposition instead.}
\begin{restatable}{prop}{montonicCE}
\label{prop:montonicCE}
$\forall M \subseteq N \subseteq V$, $I(N; Y) - I(M; Y) = I(N \setminus M; Y \mid M) \geq 0$.
\end{restatable}

A combination of \cref{prop:utilityCE} and \cref{prop:montonicCE} implies that using more modalities as input leads to equivalent or better prediction. It also shows that \cref{defi:utility} can quantitatively capture the extra prediction benefit from the additional modalities in closed-form (e.g., $I(N \setminus M; Y \mid M)$). And this extra benefit is the most apparent when test loss reaches convergence (e.g., $\inf$). This monotonicity property is also a key indication that \cref{defi:utility} can intrinsically characterize the advantage of \mm learning over unimodal learning.

\textbf{Comparison to previous results.}~~Previous work~\citep{amini2009learning,huang2021makes} have discovered similar conclusions that more views/modalities will not lead to worse optimal population error in the context of multiview and multimodal learning, respectively. They obtained this observation through analysis to the excess risks of learning from multiple and single modalities, and show that the excess risk of learning from multiple modalities cannot be larger than that of single modality. Instead, our work adopts an information-theoretic characterization, which leads to an easy-to-interpret measure on the benefits of additional modalities. Furthermore, using well-developed entropy estimators, it is relatively straightforward to estimate these measures in practice. As a comparison, excess risks are hard to estimate in practice, since they depend on the Bayes optimal errors, which limits their uses in many applications. 

Next, we show that $\util(S) = I(S; Y)$ is approximately submodular under \cref{as:eCondIndep}.
% \han{Does this paper really show this? If yes, could you point out the concrete theorem in this paper?}
Previously, \citet{krause2012near} has shown mutual information to be submodular under strict conditional independence. Here we provide a more flexible notion of submodularity for mutual information. There are also other generalizations of submodularity such as weak submodularity~\citep{khanna2017scalable} or adaptive submodularity~\citep{golovin2011adaptive}. Our definition of approximate submodularity is more specific to the case of mutual information and \cref{as:eCondIndep}. 

% \han{Since the following result is not the main one in our paper, I'd also call it a proposition instead of theorem, and reserve the theorem environment for the most important results.}
\begin{restatable}{prop}{eSubmodularityMI}
\label{prop:eSubmodularityMI} 
Under \cref{as:eCondIndep}, $ I(S; Y)$ is $\epsilon$-approximately submodular, i.e., $\forall A \subseteq B \subseteq V$, $e \in V \setminus B$, $I(A \cup \{e\}; Y) - I(A; Y) + \epsilon \geq I(B \cup \{e\}; Y) - I(B; Y)$.
\end{restatable}

The above proposition states that if conditional mutual information between input modalities given output is below a certain threshold $\epsilon > 0$, then the utilty function $\util(\cdot) = I(\cdot; Y)$ admits a diminishing gain pattern controlled by $\epsilon$. This diminishing gain pattern is the definition of submodularity (\cref{defi:submodularity}). When conditional mutual information is zero, input modalities are strictly conditional independent, and $I(\cdot; Y)$ is strictly submodular.


\subsection{Modality Selection via Approximate Submodularity}
\label{sec:algorithm}

With \cref{prop:eSubmodularityMI}, we can formulate the problem of modality selection as a submodular function maximization problem with cardinality constraint, i.e., $\max_{S\subseteq V} I(S; Y)$ subject to $|S| \leq q$. Usually, $q$ is considerably smaller than $|V|$. However, \cref{thm:greedyBound} from \citet{nemhauser1978analysis} is applicable to $I(\cdot; Y)$ only if it is strictly submodular. There the approximation guarantee differs in our case because the strength of submodularity of $I(\cdot; Y)$ is controlled by the upper bound of conditional mutual information under \cref{as:eCondIndep}. Under the approximate conditional independence assumption, we show the following result.

% % In the following, we propose two lines of approaches to the modality selection problem. First, we formulate modality selection as a submodular function maximization problem, and show the approximation guarantee of a greedy solution to such problem. Then we leverage feature importance scores to propose a modality importance ranking solution to modality selection. Specifically, we describe the new computational advantages of these scores when they are adapted to the context of modality selection.

% \subsubsection{Submodular Function Maximization}

% In modality selection, given a total of $k$ modalities, the goal is to find $q$ modalities which yields maximum utility towards prediction, where $q$ is usually much smaller than $k$. When this problem is formulated into submodular function maximization, the set function to be maximized would be the utility function $\util$. And as shown in \cref{prop:utilityCE} and \cref{prop:eSubmodularityMI}, in the setting of binary classification with cross-entropy loss, $\util(\cdot)$ equals to the Shannon mutual information $I(\cdot; Y)$, and is approximately submodular \wrt some $\epsilon > 0$ under \cref{as:eCondIndep}. In the following, we show an approximation guarantee on maximizing $I(\cdot; Y)$ under $\epsilon$-approximate submodularity based on prior results on absolute submodularity from \cite{nemhauser1978analysis}.

% \han{Better to state the following theorem by saying Assumption 3.1 holds rather than conditions in Theorem 4.1 holds.}
\begin{restatable}{thm}{greedyBoundMI}
\label{thm:greedyBoundMI}
Under \cref{as:eCondIndep}, let  $q \in \sZ^+$, and $S_p$ be the solution from \cref{algo:greedyOrginal} at iteration $p$, we have:
\begin{equation}
    I(S_p; Y) \geq (1-e^{-\frac{p}{q}})\max_{S: |S|\leq q}I(S; Y) - q\epsilon
\end{equation}
\end{restatable}

To summarize, \cref{thm:greedyBoundMI} states that any subset of selected modalities produced by \cref{algo:greedyOrginal} has an approximation guarantee, in the setting of classification with cross-entropy loss. Since $I(\cdot; Y)$ is monotonically non-decreasing, we can run \cref{algo:greedyOrginal} for $p=q$ iterations to get the best possible \greedyvalue that is at least $1-\frac{1}{e}$ of the \optimalvalue minus $q\epsilon$. The $q\epsilon$ term characterizes the fact that, if the to-be-optimized function is not always submodular, the upper bound of conditional mutual information $\epsilon$ could cause a larger approximation error as the algorithm runs longer. Nonetheless, when $\epsilon = 0$, our result in \cref{thm:greedyBoundMI} reduces to \cref{thm:greedyBound}.

% Following \cref{thm:greedyBoundMI}, we can obtain an upper bound for the optimal expected cross-entropy loss as well as the optimal expected zero-one loss for the greedy solution as \cref{cor:01GreedyBound}.

Using \cref{thm:greedyBoundMI}, we can further obtain a bound on the minimum of expected cross-entropy loss and expected zero-one loss achieved by the \greedyset. Let us first denote \optimalset $\argmax_{S: |S|\leq q}I(S; Y)$ as $S^*$, then:

\begin{restatable}{cor}{greedyLossBound}
\label{cor:greedyLossBound}
Assume conditions in \cref{thm:greedyBoundMI} hold, there exists optimal predictor $h^*(S_p) = \Pr(Y\mid S_p)$ such that
\begin{align}
    \Exp[\ell_{01}(Y, h^*(S_p))] \leq{}& \Exp[\ell_{ce}(Y, h^*(S_p))] \nonumber \\
    \leq{}& H(Y) - (1-e^{-\frac{p}{q}})I(S^*; Y) + q\epsilon
\end{align}
\end{restatable}

\cref{cor:greedyLossBound} shows that the minimum of both losses achieved by $\Pr(Y\mid S_p)$ are no more than the uncertainty of the target output minus the lower bound of our \greedyvalue from \cref{thm:greedyBoundMI}. We can also upper bound the difference in minimum cross-entropy losses achieved by the \greedyset and the \optimalset.

\begin{restatable}{cor}{greedyLossDiff}
\label{cor:greedyLossDiff}
Assume conditions in \cref{thm:greedyBoundMI} hold. There exists optimal predictors $h_1^* = \Pr(Y\mid S_p)$, $h_2^* = \Pr(Y\mid S^*)$ such that
\begin{align}
    \Exp[\ell_{ce}(Y, h_1^*(S_p))] - \Exp[\ell_{ce}(Y, h_2^*(S^*))] \nonumber \\ \leq e^{-\frac{p}{q}}I(S^*; Y) +q\epsilon
\end{align}
\end{restatable}

This result expresses a guarantee on the maximum loss difference from the \greedyset versus the \optimalset using optimal predictors. Both bounds from \cref{cor:greedyLossBound} and \cref{cor:greedyLossDiff} are paramterized by the duration and constraint ($p$, $q$) of \cref{algo:greedyOrginal}, as well as the approximation error induced by $\epsilon$. As the algorithm attempts to select a larger set of modalities, both bounds become looser.


Overall, under the setting described in \cref{sec:preliminary}, the (approximate) submodularity of the utility function allows us to have a solution in \polytime with approximation guarantee for modality selection under cardinality constraint. Under this theoretical formulation, we can directly extend results of other submodular optimization problems to solve modality selection problems with different constraints and objectives \citep{wolsey1982analysis, krause2014submodular}.

\subsection{Modality Importance}
\label{sec:feature}

We also examine the possibility of adapting feature importance scores to the context of modality selection, by using them to rank individual modalities. Specifically, we consider Shapley value and MCI. We will show that both the computations of the exact Shapley value and MCI of a modality set is efficient, if our utility function is used as the underlying evaluation function.
% In this section, we examine the possibility of adapting features importance scores to the context of modality selection, by using them to rank individual modalities. When the utility function (\cref{defi:utility}) serves as the evaluation function for Shapley value and MCI, computing these two scores of a modality can be efficient. Again, we focus on the setting of binary
As previously shown, the utility of a modality $\util(\{X_i\}) = I(X_i; Y)$ in the classification with cross-entropy loss setting. To proceed, we first show the following propositions for $I(X_i; Y)$.


% An alternative to dynamically selecting modalities is to statically rank all $k$ modalities by their individual contributions towards prediction, and select the top-$q$ ranked modalities that have the highest contribution scores. Existing feature importance scores described in \cref{sec:preliminary:feature} provide a fertile starting point for developing this modality ranking paradigm to solve modality selection. For example, our utility function $\util$, which attempts to measure each modality's contribution towards prediction, can be seamlessly used as the evaluation function in existing feature importance scores.

% adapting feature importance scores to modality selection can bring forth new computational advantages that are harder to be true in feature selection. Specifically, under a few simple assumptions, the exact Shapley value and MCI of a modality can be computed efficiently if $\util$ from \cref{defi:utility} is the evaluation function. For simplicity, we focus on the setting of binary classification with cross-entropy loss, in which $\util(\cdot) = I(\cdot; Y)$. First, we present our results on using $I(\cdot; Y)$ as the evaluation for Shapley value in modality selection.

\begin{restatable}{prop}{subadditiveMI}
\label{prop:subadditiveMI}
Under \cref{as:eCondIndep}, $I(S; Y)$ is $\epsilon$-approximately sub-additive for any $S\subseteq V$, i.e., $I(S\cup S'; Y) \leq I(S; Y) + I(S'; Y) + \epsilon$.
\end{restatable}

\textbf{Shapley value.}~~In the classic definition (\cref{defi:shapley}), the complexity of computing the exact Shapley value of a player is exponential. However, because \cref{defi:shapley} involves a summation of the marginal contribution $I(S\cup\{X_i\}; Y) - I(S; Y)$, we can leverage the sub-additivity to provide an upper bound of the Shapley value $\phi_{I, X_i}$ via a summation of $I(X_i; Y)$s for all possible subsets. Analogously, the super-additivity should provide a lower bound of $\phi_{I, X_i}$ again expressed by $I(X_i; Y)$. Putting two bounds together gives us an efficient approximation of $\phi_{I, X_i}$. Nonetheless, for $I(S; Y)$ to be super-additive, variables in $S$ must be marginally independent. Thus, we further introduced \cref{as:eMarginalIndep} for this setting. Although \cref{as:eMarginalIndep} is seemingly stronger than \cref{as:eCondIndep}, it will provide great convenience in approximating the Shapley value of a modality efficiently with a better guarantee parameterized by $\epsilon$, as the following shows.
% Solving the exact Shapley value requires exponential time. However, because the original definition of Shapley value involves a summation of the marginal gain $I(S\cup\{X_i\}; Y) - I(S; Y)$, we can leverage the sup-additivity above to provide an upper bound of the Shapley value $\phi_{I, X_i}$ via a summation over all possible combinations expressed by $I(X_i; Y)$. Similarly, a super-additivity property of $I(\cdot;\cdot)$ should provide a lower bound of the Shapley value expressed by $I(X_i; Y)$. But super-additivity on $I(\cdot; \cdot)$ would require marginal independence between input modalities. Although marginal independence condition is stronger than \cref{as:eCondIndep}, in cases where such condition hold, it will provide great convenience in approximating Shapley value efficiently with a tighter guarantee.

% \han{The transition here is a bit abrupt. I would suggest putting more motivation on why studying the approximate marginal independence assumption (for the tractable computation of Shapley, etc).}
\begin{restatable}[$\epsilon$-Approximate Marginal Independence]{as}{eMarginalIndep}
\label{as:eMarginalIndep}
There exists a positive constant $\epsilon > 0$ such that, $\forall S, S'\subseteq V, S\cap S' = \emptyset$, we have $I(S; S') \leq \epsilon$.
\end{restatable}

\begin{restatable}{prop}{superadditiveMI}
\label{prop:superadditiveMI}
Under \cref{as:eMarginalIndep}, $I(S; Y)$ is $\epsilon$-approximately super-additive for any $S\subseteq V$, i.e., $I(S\cup S'; Y) \geq I(S; Y) + I(S'; Y) - \epsilon$.
\end{restatable}

\begin{restatable}{prop}{shapleyMI}
\label{prop:shapleyMI}
If conditions in \cref{prop:subadditiveMI} and \cref{prop:superadditiveMI} hold, we have $ I(X_i; Y) - \epsilon \leq \phi_{I, X_i} \leq I(X_i; Y) + \epsilon$ for any $X_i \in V$.
\end{restatable}

% \han{The discussion here in this paragraph is a bit vague and fast. We should make it derivation more clear and easy to follow.}
If \cref{prop:subadditiveMI} holds, the Shapley value of any modality $X_i \in V$ will be upper bounded by its own prediction utility plus $\epsilon$, i.e., $\phi_{I, X_i} \leq I(X_i; Y) + \epsilon$. On the other hand, we can further lower bound the Shapley value if  \cref{prop:superadditiveMI} also holds, $I(X_i; Y) + \epsilon \leq \phi_{I, X_i}$. In both bounds, $I(S\cup\{X_i\}; Y) - I(S; Y)$ becomes $I(X_i; Y)$, and the summation of all fraction factors in fact equals to 1. If both \cref{prop:subadditiveMI} and \cref{prop:superadditiveMI} hold with $\epsilon = 0$, $I(\cdot; Y)$ is additive, in which case, the Shapley value of a modality is exactly its prediction utility, i.e., $\phi_{I, X_i} = I(X_i; Y)$. Furthermore, by the efficiency property of the Shapley value, we must have $I(V; Y) = \sum_{X_i \in V} \phi_{I, X_i}$.

% On the other hand, if conditional independence and marginal independence between modalities are not satisfied, computing the exact Shapley value of each modality based on \cref{defi:shapley} requires exponential time. Furthermore, as mentioned in \cref{sec:preliminary:feature}, Shapley value can underestimate the importance of correlated features by assigning them lower scores. This might be an issue in modality selection because distinct but correlated modalities should be still selected if learning them can benefit the prediction more than the others.

\textbf{MCI.}~~As claimed by \citet{catav2021marginal}, MCI has an extra benefit over Shapley value (\cref{sec:preliminary:feature}). By its definition, solving MCI of a feature requires $\mathcal{O}(2^{|F|})$, where $|F|$ is the total number of features. But if the evaluation function of MCI is submodular, we can efficiently compute the exact MCI. Using \cref{prop:eSubmodularityMI}, we have the following result.
% \han{Again, better to directly state the Assumption 3.1 rather than recalling the conditions in Theorem 4.1, since it requires recursive tracing.}
\begin{restatable}{prop}{efficientMCI}
\label{prop:efficientMCI}
Under \cref{as:eCondIndep}, $\forall X_i \in V$, we have $I(X_i; Y) \leq \phi_{I, X_i}^{mci} \leq I(X_i; Y) + \epsilon$.
\end{restatable}
If $\epsilon = 0$, $I(S; Y)$ will be strictly submodular for any $S\subseteq V$, and the MCI of a modality is exactly its prediction utility, i.e., $\phi_{I, X_i}^{mci} = I(X_i; Y)$. If \cref{prop:subadditiveMI} holds with $\epsilon=0$, $I(\cdot; Y)$ is sub-additive, then $I(V; Y) \leq \sum_{X_i \in V} I(X_i; Y) = \sum_{X_i \in V} \phi_{I, X_i}^{mci} $. If \cref{prop:superadditiveMI} further holds with $\epsilon=0$, then $I(\cdot; Y)$ is additive, we can obtain an efficiency property of the MCI in this problem setting, i.e., $I(V; Y) = \sum_{X_i \in V} \phi_{I, X_i}^{mci}$.

\textbf{Modality selection via MCI ranking.}~~In light of these properties, we can consider ranking individual modalities by Shapley value or MCI as an alternative for modality selection besides greedy maximization. The ranking algorithm computes the Shapley value or MCI for all modalities, and returns the top-$q$ modalities with maximum scores \wrt a subset size limit $q$. One advantage of this approach is its complexity of $\mathcal{O}(|V|)$, while greedy maximization requires $\mathcal{O}(q|V|)$. As shown above, solving Shapley value efficiently requires additional assumptions to hold (\cref{as:eMarginalIndep}), thus MCI ranking would be more preferable.  

% \han{Better to use a paragraph environment with the name ranking algorithm to emphasize and describe the ranking algorithm, otherwise it is relatively easy for readers to skip this paragraph and when later reading the experiment section they can get confused.}
% Based on the above properties, we can consider ranking individual modalities by Shapley value or MCI as an alternative to select modalities besides greedy maximization. One advantage of modality ranking is it only requires a complexity of $\mathcal{O}(|V|)$, while greedy maximization requires $\mathcal{O}(q|V|)$. However, computing exact Shapley value efficiently requires marginal independence to hold, which is often a stronger condition than conditional independence.

% \subsection{Furture Work}

% We have introduced a general function to measure the prediction utility of a set of modalities in modality selection, and presented the intriguing properties of this function in the setting of binary classification with cross-entropy loss. We also proposed two lines of approaches to solve modality selection based on (1) submodular function maximization and (2) modality importance score. 

% An immediate extension of this work is to study the utility function to other machine learning settings. For example, \cite{farnia2016minimax} has shown that $\util(S)$ from \cref{defi:utility} equals to $\Var(\Exp[Y\mid S])$ under quadratic loss. This quantity could potentially reveal other interesting properties of $\util$ in the regression setting. Another extension could consider different modality selection variants. For example, in practical applications, one may be more interested in finding a minimum subset of modalities, whose utility is at least a $\delta$-approximate to the utility of all modalities, where $\delta \in (0, 1)$ controls the approximate quality. 

% \input{experiment}
\section{Experiments}

We present empirical evaluation of greedy maximization (\cref{algo:greedyOrginal}) and MCI ranking on three classification datasets. 

\textbf{\pmnist.}~~\pmnist is a semi-synthetic static dataset built upon \mnist~\citep{lecun-mnisthandwrittendigit-2010}. Specifically, we divide each image in the original \mnist into non-overlapping square patches. Each patch location represents a single modality. We construct and experiment on two \pmnist variants, where one variant has 49 patches and each patch is of size $4\times 4$ square pixel, and another has 9 patches and each patch has the side length of 9 or 10 pixels. \pmnist has ten output classes, 50,000 training images, and 10,000 testing images.

\textbf{\pems.}~~\pems is a real-world time-series dataset from UCI (~\citet{Dua:2019}). This dataset represents the traffic occupancy
rate of different freeways of the San Francisco bay area. The classification task is to predict the day of the week. Data is obtained from 963 sensors placed across the bay area, where each sensor represents a single modality. Each sensor has a time series with 144 time steps, which we down-sample to 36 via taking the regional means of size-4 windows. Running \cref{algo:greedyOrginal} requires $\mathcal{O}(q|V|)$ with $|V| = 963$, and each step requires training a new model. To mitigate extensive run-time, we experiment on 45 out of 963 sensors by filtering sensors in line for the same freeway.
% \footnote{Under the same model but different input dimensions, test accuracy before and after down-sampling the number of sensors and time steps are 82\% v/s 77\%  respectively.}\sam{thinking again, might not report}
There are a total of 440 instances (days), with the train-val-test split being 200, 67, 173 samples.  

%  Each sample now has 45 input modalities, each modality has 144 features. The training, validation and test set has 200, 67 and 173 samples respectively.

% Mention downsampling from 144 to 36

\textbf{\mosi.}~~\mosi is a popular real-world benchmark dataset in affective computing and multimodal learning~\citep{zadeh2016mosi}. The task is 3-classes sentiment classification (positive, neutral, negative) from 20 visual and 5 acoustic modalities with temporal features. Specifically, \mosi collects time-series facial action units and phonetic units from short video clips (10-seconds clip sampled at 5Hz rate). Each unit is a modality, and consists of a 50-dimensional feature vector. Training and testing sample size are 1284 and 686 respectively.


\textbf{Independence Assumption Validation}~~We validate the independence conditions (e.g., \cref{as:eCondIndep}) on all datasets by comparing the mean conditional Mutual Information (MI) and the mean marginal MI of disjoint modalities \citep{gao2017estimating}. As shown in \cref{tab:condIndep}, the conditional MI is smaller than the marginal MI for \mnist and \pems. Both conditional and marginal MI are small for \mosi. This implies that modalities should be approximately conditionally independent in these datasets.

\begin{table}[!htp]
\caption{Mean Marginal/Conditional Mutual Information}
\label{tab:condIndep}
\begin{tabular}{crr}
\toprule
\textbf{Dataset} &
\multicolumn{1}{c}{\textbf{Mean Marg. MI}} & \multicolumn{1}{c}{\textbf{Mean Cond. MI}} \\
\midrule
\textbf{Patch-MNIST} & 2.187 & 0.078 \\
\textbf{PEMS-SF} & 0.626 & 0.223 \\
\textbf{CMU-MOSI} & 0.064 & 0.069 \\
\bottomrule
\end{tabular}
\end{table}

\begin{figure*}
    \centering
    \includegraphics[width=\linewidth]{figs/mnist.pdf}
    \caption{Experiment results for \pmnist with 49 modalities (first row) and with 9 modalities (second row).}
    \label{fig:mnistplot}
\end{figure*}


\subsection{Implementation}
\label{sec:implementation}

We implement greedy maximization based on the pseudo-code in \cref{algo:greedyOrginal}. We implement MCI ranking by computing the MCI for each modality in the full set, and then select the top-ranked modalities with the largest MCIs.

\textbf{Utility estimation.}~~From \cref{prop:utilityCE}, utility $\util(S)$ equals $I(S; Y)$, and $I(S; Y) = H(Y) - H(Y\mid S)$. Based on the variational formulation of the conditional entropy as the minimum cross-entropy, we approximate $H(Y\mid S)$ by using the converged training loss on $S$ to predict $Y$~\citep{farnia2016minimax}. Accordingly, to estimate the marginal gain $I(X_j; Y\mid S_i)$ from \cref{algo:greedyOrginal} over high dimensional data, we compute the difference $H(Y\mid S) - H(Y\mid S\cup \{X_j\})$~\citep{mcallester2020formal}. To compute MCI of each modality $X_j$, we just need to compute $I(X_j; Y)$, according to \cref{prop:efficientMCI}.

% To estimate the marginal gain $I(X_j; Y\mid S_i)$ with high dimensional data in \cref{algo:greedyOrginal}, we use conditional entropy difference $H(Y\mid S) - H(Y\mid S\cup \{X_j\})$~\citep{mcallester2020formal}, where the conditional entropy $H(Y\mid S)$ is the converged training loss on input $S$~\citep{farnia2016minimax}. To implement MCI ranking, we find the conditional entropy $H(Y\mid X_i)$ for each modality $X_i$, and prioritize the modalities with smaller $H(Y\mid X_i)$. This is because $\phi_{I, X_i}^{mci}$ is approximately $I(X_i; Y)$ by \cref{prop:efficientMCI}, and $\argmax_{X_i\in V}I(X_i; Y) = \argmin_{X_i\in V}H(Y\mid X_i)$. Similarly, in order to estimate the utility (i.e., mutual information) $I(S; Y)$ of a set, we compute $H(Y) - H(Y\mid S)$ with the same approach as above. 

% network structure for both dataset (and parameters)

\textbf{Modeling.}~~We now describe models for prediction and utility estimation. For \pmnist, we use a convolutional neural network with one convolutional layer, one max pooling layer and two fully-connected layers with ReLU for both estimation and prediction. The network is trained with Adam optimizer on a learning rate of $1e-3$. For \pems, we use a 3-layer neural network with ReLU activation and batch normalization for estimation. This is trained with Adam optimizer on a learning rate of $5e-4$. For prediction, we use a recent a time-series classification pipeline~\citep{rocket} for time-series data processing\comment{which transforms the time series into features using convolutions}, followed by a linear Ridge Classifier~\citep{loning2019sktime}. For \mosi, we experiment with two prediction model types: a linear classifier with Rocket Transformation for time-series (same as the one for \pems); and a plain 3-layer fully-connected neural network with ReLU activation. On each dataset, the number of training epochs are the same for all evaluated approaches across different modality subset sizes. 

%same model architecture for both utility estimation and prediction: a convolutional neural network with one convolutional layer, one max pooling layer and two fully-connected layers with ReLU. We also use the Adam optimizer with learning rate 0.001. \han{Briefly explain why on the PEMS dataset we need to use different models.} We use different models for utility estimation and prediction for \pems. For estimation, we use a 3-layer neural network with ReLU activation, batch normalization, and Adam optimizer with a learning rate of 0.0005. To mitigate over-parameterization, we use a simpler but time-series-preferred model for prediction: Rocket transformation followed by a linear Ridge Classifier~\citep{rocket, loning2019sktime}. 


\subsection{Experimental Procedures}
\label{sec:experimentprocedure}
In each iteration $i$ of the \cref{algo:greedyOrginal} we execute the following: (1) for each candidate modality $X_j$: (a) train two models on $S$ and $S\cup \{X_j\}$ respectively until training losses converge, (b) take the loss difference to be $I(X_j; Y\mid S_i)$; (2) record test loss and accuracy from the model trained on $S_i \cup \{X^i\}$ before the model over-fits; (3) add selected modality $X^i$ to $S_i$ and go to next iteration. We use model parameters before over-fitting for prediction, and parameters after over-fitting for utility estimation. 
% To detect over-fitting, we use early-stopping for \pmnist and validation set for \pems. 

Step (2) for \pems and \mosi are slightly different, in which we record and show the training loss before over-fitting instead of the test loss. This is because \pems and \mosi have a much smaller sample size than \pmnist with potentially noisier features, the model likely will not generalize stably. Thus we first examine \cref{thm:greedyBoundMI} and MCI ranking on a larger sample set which better represents population and not influenced by the generalization gap. Then we analyze with the test accuracy to accounting the generalization.


For \pmnist with 49 modalities, \pems and \mosi, we evaluate \cref{algo:greedyOrginal} and MCI ranking against a randomized baseline at each set size. The randomized baseline randomly selects a modality iteratively. For \pmnist with 9 modalities, we further include optimal and average baselines. At each set size $q$, the optimal baseline is the optimal value from all possible subsets of size $q$, and the average baseline is the average. We only implement the optimal baseline for the 9 modalities case because evaluating on all possible subsets for a larger set is expensive.

\textbf{Training cost.}~~At each iteration of \cref{algo:greedyOrginal}, the marginal utility gain for each candidate modality is evaluated. Since we estimate the conditional mutual information by training a neural network, and we need to evaluate each modality subset at different set sizes, each iteration involves model training. These experiments can be costly for large datasets and models. The training cost at each iteration of Algorithm 1 depends on different utility variants, or mutual information estimation methods in this setting.

\begin{figure}[!htp]
    \centering
    \includegraphics[width=\linewidth]{figs/mnist_select.pdf}
    \caption{Modality selection paths of \cref{algo:greedyOrginal} (first row) and ranking via MCI (second row) in \pmnist.}
    \label{fig:selection}
% \vspace*{-1.5em}
\end{figure}


\subsection{Results and Empirical Analysis}
% \han{Just a small suggestion: I'd put some of the figures in Page 7 otherwise the readers may find it to be dry when reading a long paragraph of descriptions.}

% \han{In \LaTeX, for quotes, use ``'' rather than ""}

% \han{The results for \pmnist are quite long and contain several different experimental observations / analysis. I would suggest putting \pmnist as a subsection, then use paragraph with a proper title to describe different experiments, e.g., the utility curve, the path showing the selected patches, etc.}


\begin{figure*}
    \centering
    \includegraphics[width=\linewidth]{figs/pems.pdf}
    \caption{Experiment results for \pems.}
    \label{fig:pems}
\end{figure*}

\begin{figure*}
    \centering
    \includegraphics[width=\linewidth]{figs/mosi.pdf}
    \caption{Experiment results for \mosi.}
    \label{fig:mosi}
\end{figure*}

\subsubsection{\pmnist}

\cref{fig:mnistplot} shows the \pmnist experiment results. In this figure, ``Modality subset size'' refers to the size of the selected modality set. ``Utility'' refers to the utility of the selected set. The ``Test CE Loss'' and ``Test Accuracy'' refers to the cross-entropy loss and prediction accuracy on test data from the model that is trained on the selected set. 


\textbf{Utility.}~~An immediate observation is the high correlation among the utility, test cross-entropy loss and accuracy in both rows. The trend of test accuracy seems identical to the utility, although they mildly differ when the set size exceeds 30. In addition, the utility and test loss is negatively correlated, matching to \cref{defi:utility}. Utility has a larger upper bound than test loss, potentially because the utility is estimated by converging training loss, which is often reduced in greater magnitude than test loss. The utility has a trend of non-decreasing and diminishing gain, which matches the monotonicity and (approximate) submodularity shown in this setting. Adding more modalities is unnecessary if the subset is already large: in the 49-modalities case, accuracy barely improves after 20 modalities selected; but in 9-modalities case, this pattern is less obvious.

% Overall, these results implies that our defined utility effectively reflects how a modality set performs in terms of loss.

% \han{Do not just describe the similarity between these two scores. After describing this phenomenon, say that this further shows that the proposed utility measure is a good one for predicting the accuracy. Basically, it's a good practice to add your own thoughts and analysis of the experimental results. }

% \han{When I was going through the paper, I found that there are many places where you use ``And'' to start a sentence. This is OK but delivers less logical flow than other conj., such as Next, Therefore, Hence, etc. For this specific sentence, what is being described here is not in parallel to the last observation logically, so I'd rather avoid using ``And'' here.}

\textbf{Greedy maximization.}~~
\comment{We now examine the performance of the greedy maximization (\cref{algo:greedyOrginal}) against its theoretical guarantee.}
\cref{algo:greedyOrginal} beats random selection in both cases. In \cref{fig:mnistplot} (second row), it beats the average by selecting the modality with maximum utility from the start, and overlaps its trajectory with the optimal. In \cref{fig:mnistplot} (first row), \cref{algo:greedyOrginal} achieves near-maximum utility with only 7 modalities. These results validate the approximate guarantee from \cref{thm:greedyBoundMI}. In fact, the guarantee on utility is empirically much better than theoretically proven.

\textbf{MCI ranking.}~~In the 9-modalities case, MCI ranking is as good as greedy maximization and the optimal baseline when the full set has fewer modalities. When more modalities are available for selection (e.g., 49 modalities), \cref{algo:greedyOrginal} select a subset that minimizes the loss slightly further than the highest ranked modalities when set size below 15. 


\textbf{Modality selection path.}~~We plot the modality selection paths from \cref{algo:greedyOrginal} and MCI ranking in \cref{fig:selection}. We can see that MCI selects the modalities that each contain the most information to output -- the center regions. Whereas the modalities selected by \cref{algo:greedyOrginal} are more diverse, covering different spatial locations of the original image, leading to an advantage in gaining more information collectively.


% mnist
%  is the utility a good reflection of test loss reduction? what's the accuracy?
%  when there are few modalities, how does rank and greedy perform over optimal and average
%  when there are more modalities
% pems


% is divided into 49 non-overlapping chunks of size $4\times4$. Each chunk serves as a modality. We start with the empty set, then in each iteration, a new modality will be added to the current set and the performance will be measured until all modalities are selected. For instance, if the current subset contains $n$ modalities, each sample will have shape $n\times4\times4$. The training set contains 50K images and the test set contains 10K images.


% We evaluate the proposed algorithms on classification tasks by performing experiments on two datasets: the chunked MNIST dataset, a semi-synthetic dataset built upon MNIST  (\cite{lecun-mnisthandwrittendigit-2010}) and the PEMS-SF dataset (\cite{Dua:2019}), which is a real life dataset for the problem of sensor placement in the San Francisco Bay Area. The purpose of this section is as follows.
% \begin{itemize}
%     \item Compare the performance of the greedy and the importance ranking algorithm.
%     \item Empirically verify the submodularity of the utility function.
%     \item Verify the performance guarantee of \cref{algo:greedyOrginal} described in \cref{thm:greedyBoundMI}.
% \end{itemize}


% \subsection{Mutual Information Estimation}

% In both \cref{algo:greedyOrginal} and the modality importance ranking algorithm, we use mutual information as utility function to guide the modality selection. By \cref{prop:montonicCE}, when adding a new modality $x$ to a set of modalities $S$, the additional utility is quantified by $I(x;Y|S)$. By \cref{defi:utility}, 
% \[
%     I(x;Y|S)=\inf_{h\in\fH} \Exp[\ell_{ce}(Y, h(S))]-\inf_{h\in\fH} \Exp[\ell_{ce}(Y, h(S\cup \{x\}))].
% \]
% To estimate this quantity, in each greedy selection step, we train two neural networks on $S$ and $S\cup \{x\}$ respectively for all remaining modalities $x$ until convergence to obtain the minimum cross entropy loss and take the difference.

% For the modality importance ranking algorithm, the utility for each modality $x$ is 
% \[
%     I(x;Y)=\inf_{h\in\fH} \Exp[\ell_{ce}(Y, h(c))]-\inf_{h\in\fH} \Exp[\ell_{ce}(Y, h(x))].
% \]
% Similarly, we train two neural networks on a zero vector $c$ and each modality $x$ respectively until convergence to obtain the minimum cross entropy loss and use the difference as an estimation of $I(x;Y)$.

% \subsection{Chunked MNIST}
% \paragraph{Setup}

% Each image of the original MNIST dataset (\cite{lecun-mnisthandwrittendigit-2010}) is divided into 49 non-overlapping chunks of size $4\times4$. Each chunk serves as a modality. We start with the empty set, then in each iteration, a new modality will be added to the current set and the performance will be measured until all modalities are selected. For instance, if the current subset contains $n$ modalities, each sample will have shape $n\times4\times4$. The training set contains 50K images and the test set contains 10K images.

% \paragraph{Results} 

% We run \cref{algo:greedyOrginal} and the modality importance ranking algorithm on the dataset. We also run a random algorithm such that in each step, one random modality is added to serve as baseline. In each selection step, we use a neural network with 1 convolutional layer, 1 max pooling layer and 2 fully-connected layers with ReLU activation. 

% From Figure~\ref{fig:mnistplot}, we can see that the greedy algorithm consistently outperforms the ranking algorithm, especially at the beginning of the modality selection process. Both the greedy and the ranking algorithm significantly outperforms the random algorithm until most modalities are selected. The trend of the utility plot and the accuracy plot is very similar, showing that the utility we define is a good indicator for the test accuracy. In addition, the utility curve for the greedy algorithm demonstrates a clear concave pattern, which corresponds to the diminishing return property of submodular functions. Moreover, starting from 7 modalities, the utility of subsets selected by \cref{algo:greedyOrginal} reaches 90\% of the utility of all modalities, showing the near-optimal performance described in \cref{thm:greedyBoundMI}.

% In Figure~\ref{fig:sel_pattern}, the plots in a row shows the selected modalities (highlighted in yellow) every 2 selection steps. It demonstrates the selection process of the first 10 iterations of both algorithms, where the selected modalities differ the most. We can see that the greedy algorithm first select chunks that resembles the shape of a digit, while the ranking algorithm selects from center chunks to corners. This can explain why the greedy algorithm performs better at the earlier stage of modality selection.

% \begin{figure}[!htp]
%     \centering
%     \includegraphics[scale=0.4]{neurips-draft/plot/9_mi.png}
%     \caption{Utility plot for 9 modalities.}
%     \label{fig:res_9}
% \end{figure}

% From \cref{thm:greedyBoundMI}, the utility of each set of modalities chosen by \cref{algo:greedyOrginal} is lower bounded by 0.63 times the utility of  the optimal set with the same size. To find such optimal set, we need to form the powerset of all modalities, which is computationally intractable in the case of 49 modalities. Thus, we use Figure~\ref{fig:res_9} to illustrate the utility on 9 modalities with the same setting. We can see that for this simple task, the rank, greedy and optimal curves basically overlap with each other. This shows that empirically, \cref{algo:greedyOrginal} can achieve near-optimal performance that is much higher than the lower bound.



%Next, we present our results on a real life classification dataset taken from the UCI (\cite{Dua:2019}) repository. This data comes from sensors placed across different highways of the San Francisco bay area, which record the traffic occupancy rate. The task is to classify the day of the week. We chose 45 out of the 963 sensors by removing multiple sensors in line for the same freeway. 
%Each sensor records a time series which is sampled at every 10 minutes of the day. We treat each sensor as a modality, and this serves as a classic application for sensor placement wherein the task would be to optimize the number of sensors for downstream prediction.

\subsubsection{\pems}

\cref{fig:pems} shows our experiment results on \pems. In \cref{fig:pems}, the two leftmost plots show the utility and cross-entropy loss on the training data. The rightmost plot of \cref{fig:pems} shows the moving average of test accuracy instead, because model was not generalized stably \comment{and the accuracy is volatile}under small sample size. 
\comment{We take the average across 3 trials owing to the randomness of the Rocket transform.(\cref{sec:experimentprocedure}).} 
%, and record the moving average of the test accuracy owing to the instability of learning on many modalities under small sample size \citep{huang2021makes}.

%\han{We only have one row in this figure. May need to change the description here. I may be able to check this section tomorrow noon.}

\textbf{Utility.}~~The difference in utility and loss among \cref{algo:greedyOrginal}, MCI ranking and random baseline are small, and all of them quickly converge to the minimum possible value after selecting only a few modalities. This is potentially because almost each of the modality is sufficient to make training loss small. However greedily selected subsets still has slightly more utility than subsets from MCI ranking and random baseline at every set size. Overall, we still observe the utility is monotone and (approximate) submodular; and \cref{algo:greedyOrginal}'s achieved utility matches \cref{thm:greedyBoundMI}. 

% \paragraph{Evaluation}The rightmost plot \cref{fig:pems} in shows the moving average test accuracy (specifically because the accuracy is volatile under small test sample size). We take the average across 3 trials owing to the randomness of the Rocket transform.

\textbf{Generalization.}~~From the test accuracy plot, we can see a clear advantage from the \greedyset over others when the subset size is small. Meanwhile, MCI ranking is worse than random baseline, which could imply that MCI ranking does not have a robust performance guarantee as \cref{algo:greedyOrginal}.
% This implies that while training performance with MCI might be at par with the greedy maximization, the generalization capability is not always guaranteed. 
Other than that, the test accuracy of \cref{algo:greedyOrginal} gradually decreases as more modalities are added. This is inline with the over-fitting artifact of greedy feature selection from \citet{blanchet2008forward}. However, in the regime of good generalization, greedy maximization should preserve the performance guarantee during testing. 

%However, test accuracy of the top-ranked subset and randomly selected subset gradually outperforms \greedyset. This may be due to several reasons: (1) volatility from the small test sample size, (2) test loss of the linear Ridge Classifier for time series data is not cross-entropy loss, (3) some previous included features are influenced by features from the newly selected modalities such that their joint presence harms the model's generalizability.

\subsubsection{\mosi}
The results are alike for both prediction model types mentioned in \cref{sec:implementation} for \mosi. Thus we only use \cref{fig:mosi} to show the \mosi evaluation results from the 3-layer fully-connected neural network. In \cref{fig:mosi}, the two leftmost plots show the utility and cross-entropy loss on the training data. The rightmost plot of \cref{fig:mosi} shows the moving average of test accuracy since the model lacks the capacity to generalize well for this dataset \comment{and the accuracy is volatile}under small sample size. 

Overall, many previous observations from other datasets still hold for \mosi. For example, the utility curve is approximately submodular and monotone as number of selected modalities increases. Modalities selected by \cref{algo:greedyOrginal} and MCI ranking outperform randomly selected modalities by having more utility, lower training loss, higher testing accuracy, especially when the number of modalities is still small. On the other hand, potentially due to the simplicity of the model and noisy features, we are unable to observe an increase of testing accuracy as more modalities are included in \cref{algo:greedyOrginal} and MCI ranking.  


 %As the subset size increases, all the methods converge to a similar accuracy.
 %, suggesting possible over-fitting.
%\begin{figure}[!htp]
 %   \centering
   % \includegraphics[scale=0.4]{neurips-draft/plot/ce_loss_pems.png}\\
  %  \includegraphics[scale=0.4]{neurips-draft/plot/utility_pems.png}
  %  \caption{Training results for the performance of 3 algorithms for PEMS - SF.}
  %  \label{fig:fig_pems}
%\end{figure}

% It can be observed that amongst all 3 algorithms, the greedy algorithm out performs both MCI ranking and random selection methods on the training set - it achieves lower loss as well as higher utility at every step of the greedy iteration. 

% Note that we take the average across 3 trials due to the randomness of the Rocket transform. We record the moving average of the test accuracy owing to the small test sample size.

%\begin{figure}[!htp]
 %   \centering
  %  \includegraphics[scale=0.4]{neurips-draft/plot/t%estpems.png}
 %   \caption{Test Evaluation Result for PEMS-SF.}
  %  \label{fig:fig_pemstest}
%\end{figure}

% It is interesting to note the following observations:
% \begin{itemize}
%     \item There is an overall \textit{decrease} in accuracy as we increase the subset size for all 3 methods - which is most pronounced in the greedy method. This can be explained by  possible overfitting on the train dataset, as is also common in greedy forward feature selection(\cite{blanchet2008forward}). \item Till the point of good generalization i.e. around a subset size of 20 modalities, our algorithm beats the other two methods, with an accuracy difference of ~2\%. 
% \end{itemize}

% We can infer that in the regime of good generalization, our greedy algorithm can chose a better set of modalities as compared to the other counterparts. Further, we see that MCI does not perform competitively unlike in the Patch-MNIST example - which shows that while training performance with MCI might be at par with the greedy algorithm, the generalization capability is not guaranteed.


% \input{related-work}
% \vspace*{-1em}
\section{Related Work}
\label{sec:related}
% \han{I suggest putting the related work section after the experiment section since a lot of the discussions in this section already requires some preliminary knowledge that has not been introduced.}

\textbf{\mmcap Learning}~~
% Multimodal learning is a sub-domain of machine learning which aims to learning from data across multiple modalities (e.g., images, text).
\mmcap learning is a vital research area with many applications~\citep{liu2017facial, pittermann2010emotion, frantzidis2010classification}.
%, and has been covered by multiple surveys (\cite{guo2019deep, baltruvsaitis2018multimodal, atrey2010multimodal}). 
%The main problems of focus in  the research of \mm learning include the representation (\cite{ngiam2011multimodal, fukui2016multimodal}), translation (\cite{zhu2017toward,huang2018multimodal}), and fusion (\cite{zadeh2017tensor, liu2018efficient}) of \mm data.Recent work studied cross-modal learning applications (\cite{alikhani2020clue,thapliyal2020cross}), large-scale \mm pretraining (\cite{wu2021n, bapna2022mslam}), and multimodality in different learning scenarios such as meta-learning (\cite{vuorio2019multimodal}) or few-shot learning (\cite{tsimpoukelli2021multimodal}). 
Theoretically, \citet{huang2021makes} showed that learning with more modalities achieves a smaller population risk, and this marginal benefit towards prediction could be upper bounded. However, the existing measure of marginal benefit~\citep{huang2021makes} is hard to understand and cannot be easily estimated, and it does not provide further insight on the emerging modality selection problem.

\textbf{Submodular Optimization}~~
% Roughly speaking, a set function is submodular if it satisfies a diminishing marginal utility property. 
Thanks to the benign property of submodularity, many subset selection problems, which are otherwise intractable, now admit efficient approximate solutions~\citep{fujishige2005submodular,iwata2008submodular,krause2014submodular}. The first study of greedy algorithm over submodular set function dates back to~\citet{nemhauser1978analysis}. Since then, submodular optimization has been widely applied to diverse domains such as machine learning \citep{wei2015submodularity}
, distributed computing,
% (\cite{mirzasoleiman2013distributed}), 
and social network analysis \citep{zhuang2013influence}.  % In machine learning, submodular optimization is used in applications such as example-based clustering (\cite{gomes2010budgeted}), document and data summarization (\cite{lin2011class,mirzasoleiman2016fast}), and active learning (\cite{guillory2012active}). The types of submodular optimization problem vary. 
A typical type of problem is submodular maximization, which can be subject to a variety of constraints such as cardinality, matroid, or knapsack constraints (\cite{lee2010submodular,iyer2013submodular}). %buchbinder2014submodular There are also work in submodular minimization (\cite{iyer2013submodular, iwata2008submodular}), and optimization on various generalized notions of submodularity (\cite{{golovin2011adaptive,elenberg2018restricted}}). 
In our case, we extended results from \citet{nemhauser1978analysis} to the case of approximate submodularity of mutual information in a \mm learning setting.
% \han{Mention how our work is related to the existing ones. It's OK to say that we apply the techniques in the literature to develop a greedy algo. that works when only approx. submodularity is met.}

\textbf{Feature Selection}~~Feature selection asks to find a feature subset that can speed up learning, improve prediction and provide better interpretability to the data/model~\citep{li2017feature,chandrashekar2014survey}. 
%Literature in this domain is vastly diverse with rich history~\citep{dy2004feature, liu2005toward, stewart1993early,wold1987principal}. 
Here we briefly touch related work on feature selection more relevant to our context. Information-theoretic measures such as mutual information have been as a metric for feature selection~\citep{brown2012conditional,fleuret2004fast,chen2018learning}. For example, \citet{brown2012conditional} presents a unified information-theoretic feature selection framework via conditional likelihood maximisation. There are also work on feature selection in regression problems through submodular optimization~\citep{das2011submodular}. %A line of more recent work focused on interpreting predictions through feature importance attribution~\citep{lundberg2020local,lundberg2017unified}. These line of work are mostly building upon the concept of Shapley value from \citet{shapley1953value} -- a feature importance score from cooperative game theory (\cref{sec:preliminary:feature}). Since the original Shapley value is intractable, some prior work attempt to understand under what circumstances its computation complexity can be improved~\citep{lundberg2017unified,van2021tractability,covert2020understanding}. 
In our context, a distinction between the problems of modality selection and feature selection are the assumptions of the underlying data (\cref{sec:preliminary}).


% \input{conclusion}
% \vspace*{-1em}
\section{Conclusion}
In this paper, we formulate a theoretical framework for optimizing modality selection in \mm learning. In this framework, we propose a general utility function that quantify the impact of a modality towards prediction, and identify proper assumption(s) suitable for \mm learning. In the case of binary classification under cross-entropy loss, we show the utility function conveniently manifests as Shannon mutual information, and preserves approximate submodularity that allows simple yet efficient modality selection algorithms with approximation guarantee. We also connects modality selection to feature importance scores by showing the computation advantages of using Shaply value and MCI to rank modality importance. Lastly, we evaluated our results on a semi-synthetic dataset \pmnist, and two real-world datasets \pems and \mosi.


% For modality selection in multimodal learning, we propose a utility measure of modalities and analyze two selection algorithms under the approximate conditional independence assumption. In particular, we show that the utility is the Shannon mutual information under the cross entropy loss, and we prove it to be approximately submodular given the assumption. Based on these observations, we demonstrate the performance guarantee of a greedy maximization algorithm and the efficiency of modality importance ranking algorithm using Shapley value and MCI. The theoretical results are verified with the experiments on the \pmnist dataset and the PEMS-SF dataset.


% \begin{acknowledgements} % will be removed in pdf for initial submission,
%                          % so you can already fill it to test with the
%                          % ‘accepted’ class option
% \end{acknowledgements}

\bibliography{citation}
\end{document}
