\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{colortbl}
\usepackage[table]{xcolor}
\usepackage{wrapfig}

\newcolumntype{P}[1]{>{\centering\arraybackslash}p{#1}}


% \definecolor{bestsingle}{HTML}{EDE1F6}
% \definecolor{bestdual}{HTML}{FFF4CE}
% \definecolor{besttriple}{HTML}{DCEFFB}

\definecolor{bestsingle}{HTML}{FAD8D4}
\definecolor{bestdual}{HTML}{D6E9D5}
\definecolor{besttriple}{HTML}{AFE3E6}
\definecolor{myblue}{RGB}{65,105,225}

\jmlrvolume{-- 223}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026}
\editors{Accepted for publication at MIDL 2026}

\title[AdaFuse]{AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Chongyu Qu\nametag{$^{1}$}} \Email{chongyu.qu@vanderbilt.edu}\\
\Name{Zhengyi Lu\nametag{$^{1}$}} \Email{zhengyi.lu@vanderbilt.edu}\\
\Name{Yuxiang Lai\nametag{$^{3}$}} \Email{yuxianglai117@gmail.com}\\
\Name{Thomas Z. Li\nametag{$^{1}$}} \Email{thomas.z.li@vanderbilt.edu}\\
\Name{Junchao Zhu\nametag{$^{1}$}} \Email{junchao.zhu@vanderbilt.edu}\\
\Name{Junlin Guo\nametag{$^{1}$}} \Email{junlin.guo@vanderbilt.edu}\\
\Name{Juming Xiong\nametag{$^{1}$}} \Email{juming.xiong@vanderbilt.edu}\\
\Name{Yanfan Zhu\nametag{$^{1}$}} \Email{yanfan.zhu@vanderbilt.edu}\\
\Name{Yuechen Yang\nametag{$^{1}$}} \Email{yuechen.yang@vanderbilt.edu}\\
\Name{Allen J. Luna\nametag{$^{1,2}$}} \Email{allen.j.luna@vanderbilt.edu}\\
\Name{Kim L. Sandler\nametag{$^{2}$}} \Email{kim.sandler@vumc.org}\\
\Name{Bennett A. Landman\nametag{$^{1,2}$}} \Email{bennett.landman@vanderbilt.edu}\\
\Name{Yuankai Huo\nametag{$^{1}$}} \Email{yuankai.huo@vanderbilt.edu}\\
\addr $^{1}$ Vanderbilt University, Nashville, TN, USA, 37215 \\
\addr $^{2}$ Vanderbilt University Medical Center, Nashville, TN, USA, 37232 \\
\addr $^{3}$ Emory University Atlanta, GA, USA, 30322 \\
}

\begin{document}

\maketitle

\begin{abstract}
Multimodal fusion has emerged as a promising paradigm for disease diagnosis and prognosis, integrating complementary information from heterogeneous data sources such as medical images, clinical records, and radiology reports. However, existing fusion methods process all available modalities through the network, either treating them equally or learning to assign different contribution weights, leaving a fundamental question unaddressed: \textbf{for a given patient, should certain modalities be used at all?} We present AdaFuse, an adaptive multimodal fusion framework that leverages reinforcement learning (RL) to learn patient-specific modality selection and fusion strategies for lung cancer risk prediction. AdaFuse formulates multimodal fusion as a sequential decision process, where the policy network iteratively decides whether to incorporate an additional modality or proceed to prediction based on the information already acquired. This sequential formulation enables the model to condition each selection on previously observed modalities and terminate early when sufficient information is available, rather than committing to a fixed subset upfront. We evaluate AdaFuse on the National Lung Screening Trial (NLST) dataset. Experimental results demonstrate that AdaFuse achieves the highest AUC (0.762) compared to the best single-modality baseline (0.732), the best fixed fusion strategy (0.759), and adaptive baselines including DynMM (0.754) and MoE (0.742), while using fewer FLOPs than all triple-modality methods. Our work demonstrates the potential of reinforcement learning for personalized multimodal fusion in medical imaging, representing a shift from uniform fusion strategies toward adaptive diagnostic pipelines that learn when to consult additional modalities and when existing information suffices for accurate prediction. Code is publicly available at: \href{https://github.com/hrlblab/adafuse}{https://github.com/hrlblab/adafuse}
\end{abstract}

\begin{keywords}
Reinforcement Learning, Multimodal Fusion, Risk Prediction
\end{keywords}

\section{Introduction}
\label{sec:intro}
Lung cancer remains the leading cause of cancer-related mortality worldwide~\cite{rivera2013establishing}, with early detection being critical for improving patient outcomes~\cite{blandin2017progress}. Low-dose computed tomography (LDCT) screening has demonstrated significant potential in reducing lung cancer mortality~\cite{bonney2022impact,ostrowski2018low}, as evidenced by the National Lung Screening Trial (NLST)~\cite{national2011reduced}. Beyond imaging, clinical variables such as age, smoking history, and family history provide complementary risk factors~\cite{tammemagi2013selection}, while radiology reports~\cite{hans2020patient} capture expert observations and contextual information. The integration of these heterogeneous data sources through multimodal fusion~\cite{cui2022survival,liu2023m} has emerged as a promising direction for more accurate and comprehensive risk prediction

\begin{figure}[t]
    \centering
    \includegraphics[width=1\textwidth]{img/fig_intro.png}
    \caption{\textbf{Comparison of multimodal fusion paradigms.} Existing multimodal fusion methods (left two) process all modalities uniformly: operation-based fusion applies fixed combination rules, while Mixture-of-Experts learns soft weights but still requires all inputs. AdaFuse (right) makes sequential discrete decisions to select patient-specific modality subsets, with the flexibility to entirely exclude uninformative modalities from computation.}
    \label{fig:intro}
\end{figure}



Multimodal fusion methods have evolved considerably over the past years. Early approaches relied on simple operations such as concatenation~\cite{mobadersany2018predicting,yap2018multimodal} or mean pooling~\cite{cheerla2019deep,ghosal2021g} to combine features from different modalities. Tensor-based methods employing Kronecker products~\cite{chen2020pathomic,wang2021gpdbn} were later introduced to capture higher-order feature interactions, though at the cost of increased computational complexity. To address this, low-rank factorization techniques~\cite{liu2018efficient,sahay2020low} have been proposed to reduce the parameter overhead while preserving expressive power. Attention mechanisms~\cite{zhu2020multimodal,lu2023multi} have gained popularity for their ability to learn cross-modal interactions and dynamically weight modality contributions. More recently, Mixture-of-Experts (MoE) architectures~\cite{cao2023multi,han2024fusemoe} route inputs through specialized sub-networks, providing input-dependent feature processing. Despite their architectural differences, all these methods process every available modality through the network, either treating them equally or learning to assign different contribution weights (i.e., soft selection), as illustrated in Figure~\ref{fig:intro} (left). This leaves a fundamental question unaddressed: \textbf{for a given patient, should certain modalities be used at all?}





In clinical practice, the diagnostic value of each modality varies across individuals~\cite{acosta2022multimodal,huang2020fusion}. Some patients may benefit from multimodal integration, while for others, a single modality suffices or additional modalities introduce noise rather than complementary information. Existing methods cannot entirely exclude uninformative modalities from computation; instead, they uniformly process all inputs regardless of their utility for individual patients. This not only incurs unnecessary computational costs but also limits the model's ability to truly personalize the diagnostic pipeline.

To address this limitation, we propose \textbf{AdaFuse}, an adaptive multimodal fusion framework that formulates modality selection as a sequential decision process. As shown in Figure~\ref{fig:intro} (right), at each step, a policy network decides whether to incorporate an additional modality or to proceed to prediction with the current selection. Unlike existing approaches that process all modalities and learn to weight their contributions, AdaFuse makes discrete decisions to select or skip each modality, providing the flexibility to use different modality combinations for different patients. This formulation naturally mirrors clinical practice, where physicians selectively order diagnostic tests based on patient-specific factors rather than exhaustively acquiring all available data~\cite{winslow1988appropriateness,ball2015improving}.

Our contributions are summarized as follows:

\begin{enumerate}
\item We propose AdaFuse, a reinforcement learning framework that formulates multimodal fusion as a sequential decision process, enabling patient-specific modality selection for lung cancer risk prediction.

\item We conduct comprehensive experiments on the NLST dataset with three modalities, comparing against single-modality baselines and various fusion strategies including concatenation, mean pooling, tensor fusion, and MoE.
\item We provide detailed analysis of learned fusion policies, offering insights into how the model adapts its modality selection across different patient subgroups and demonstrating the potential of adaptive fusion for personalized medical diagnosis.
\end{enumerate}


\section{Method}
\label{sec:method}
We present AdaFuse, an adaptive multimodal fusion framework that learns patient-specific modality selection strategies through reinforcement learning. The key insight is that different patients may benefit from different modality combinations: some patients may require comprehensive multimodal integration, while others may achieve accurate predictions with fewer modalities. Rather than applying a fixed fusion strategy uniformly, AdaFuse formulates modality selection as a sequential decision process, where a policy network learns to identify the optimal modality subset for each patient.

\begin{figure}[t]
    \centering
    \includegraphics[width=1\textwidth]{img/fig_method.png}
    \caption{\textbf{Overview of the AdaFuse framework.} \textit{Data preprocessing} extracts features from three modalities (CT images, clinical variables, and text reports) using pretrained encoders; details are provided in Section~\ref{sec:dataset}. The \textbf{AdaFuse framework} consists of three components: (1) \textit{Modality encoders} project each input feature to a shared 32-dimensional representation, with a binary mask $m\in \{0, 1\}^3$ tracking selected modalities.(2) \textit{State encoder} concatenates the masked features with the selection mask and maps them to a 64-dimensional state vector that captures the current selection status. (3) \textit{Policy heads} make sequential decisions: Step 1 selects the primary modality from $\{A, B, C\}$ ; Step 2 decides whether to add a second modality or proceed to prediction; Step 3 optionally incorporates the third modality and selects a fusion strategy from $\{Concat, Mean, Tensor\}$. After each selection, the corresponding mask entry is updated from 0 to 1, and the state encoder recomputes the state representation for the next decision. The selected modality combination is passed to the corresponding pretrained classifier among 15 fusion classifiers covering all valid modality-fusion combinations.}
    \label{fig:method}
\end{figure}

Figure~\ref{fig:method} illustrates the overall architecture. Given a patient with three available modalities, AdaFuse first encodes each modality into a compact representation. A policy network then makes sequential decisions: (1) selecting the primary modality, (2) deciding whether to add a second modality, and (3) optionally incorporating the third modality along with a fusion strategy. The selected modalities are fused and passed to a classifier for prediction. The entire framework is trained end-to-end using a combination of supervised learning and policy gradient optimization.

\subsection{Preliminaries and Notations}
\label{sec:prelim}
We consider a multimodal learning setting with three modalities: CT images, clinical variables, and radiology text reports. For each patient, we denote the raw input features as $f^A \in \mathbb{R}^{512}$ for CT image features, $f^B \in \mathbb{R}^{17}$ for clinical variables, and $f^C \in \mathbb{R}^{768}$ for text embeddings. Each modality is processed by a dedicated encoder $E^A$, $E^B$, $E^C$ to produce encoded representations $h^A = E^A(f^A)$, $h^B = E^B(f^B)$, $h^C = E^C(f^C)$, where all encoded features share a common dimension $d=32$. The modality selection process is tracked by a binary mask $\mathbf{m} = [m^A, m^B, m^C]^\top \in \{0, 1\}^3$, where $m^i = 1$ indicates that modality $i$ has been selected.

\subsection{Sequential Modality Selection}
\label{sec:sequent}

We formulate adaptive modality selection as a Markov Decision Process (MDP)~\cite{bellman1957markovian,puterman2014markov}, where the policy learns to sequentially construct the optimal modality subset for each patient.

\smallskip\noindent\textbf{State Representation.} At each decision step $t$, the state $s_t$ captures the information from currently selected modalities. We compute the state by concatenating the encoded features weighted by their selection status, along with the mask itself:
\begin{equation}
s_t = g_\theta\left( [h^A \odot m^A \,;\, h^B \odot m^B \,;\, h^C \odot m^C \,;\, \mathbf{m}] \right)
\end{equation}
Here $\odot$ denotes element-wise multiplication that zeros out unselected modalities, $[\cdot \,;\, \cdot]$ denotes concatenation, and $g_\theta$ is a two-layer MLP that maps the $(3d + 3)$-dimensional input to a 64-dimensional state vector. This formulation allows the state encoder to distinguish between ``modality not yet selected'' (zeroed features, $m^i=0$) and ``modality selected but potentially uninformative'' (non-zero features, $m^i=1$).

\smallskip\noindent\textbf{Action Space.} The decision process unfolds over at most three steps, with the action space adapting based on previous selections:
\begin{itemize}
    \item \textit{Step 1}: The policy selects one modality from $\{A, B, C\}$ as the primary modality.
    \item \textit{Step 2}: The policy decides whether to stop with the current selection, or to add one of the two remaining modalities.
    \item \textit{Step 3}: If two modalities have been selected, the policy decides whether to stop or incorporate the third modality, and selects a fusion strategy from \{concatenation, mean, tensor\}.
\end{itemize}
This sequential formulation naturally captures the hierarchical nature of modality selection: the primary modality anchors the decision, and subsequent choices refine the combination based on the accumulated information.

\subsection{Policy Network Architecture}
\label{sec:policy_network}

The policy network builds upon the modality encoders, augmented with a state encoder and step-specific decision heads.

\smallskip
\noindent\textbf{State Encoder.} The state encoder $g_\theta$ is a two-layer MLP with ReLU activations that takes the concatenation of masked modality features and the selection mask as input, producing a 64-dimensional state vector for downstream policy decisions.

\smallskip
\noindent\textbf{Policy Heads.} Each decision step has dedicated heads that output action logits $\mathbf{l}_t$, from which we sample actions via $\pi(a_t | s_t) = \text{softmax}(\mathbf{l}_t / \tau)$. The temperature $\tau$ is annealed from $\tau_{\text{init}}$ to $\tau_{\text{final}}$ over training to transition from exploration to exploitation; during inference, we use greedy decoding.

\smallskip
\noindent\textbf{Fusion Classifiers.} We maintain 15 fusion classifiers corresponding to all valid modality-fusion combinations (3 single-modality, 9 dual-modality for 3 pairs $\times$ 3 fusion types, and 3 triple-modality for 3 fusion types), and invoke the appropriate classifier based on the policy's selection.

\subsection{Learning Objective}
\label{sec:learning_objective}

\smallskip
\noindent
\textbf{Reward Design.} After the policy completes its sequential decisions and produces a prediction $\hat{p}$ for a patient with label $y$, we compute a reward signal based on two components:
\begin{equation}
r = r_{\text{BCE}} + \lambda_{\text{auc}} \cdot r_{\text{auc}}
\label{eq:reward}
\end{equation}

The first term $r_{\text{BCE}} = y \log \hat{p} + (1-y) \log(1-\hat{p})$ is the negative binary cross-entropy, which provides a continuous signal based on prediction confidence. The second term $r_{\text{auc}}$ is a mini-batch AUC reward: for each sample, we compute how well its prediction ranks relative to samples of the opposite class, normalized to $[-1, 1]$. Unlike a fixed-threshold indicator (e.g., $\hat{p} > 0.5$), this formulation is well-suited for imbalanced settings like lung cancer prediction where positive probabilities are typically low.

\smallskip
\noindent
\textbf{Policy Gradient Optimization.} We adopt REINFORCE~\cite{williams1992simple,hu2023reinforcement} over more complex algorithms (e.g., PPO~\cite{schulman2017proximal}, GRPO~\cite{shao2024deepseekmath}) due to the simplicity of our decision process: with at most three steps and a small discrete action space, the variance reduction from the batch-mean baseline and the auxiliary supervised loss provide sufficient stability. For a trajectory $\boldsymbol{\tau} = (a_1, a_2, \ldots)$ of sequential actions with log-probability $\log \pi_\theta(\boldsymbol{\tau}) = \sum_t \log \pi_\theta(a_t | s_t)$, the policy gradient loss is $\mathcal{L}_{\text{PG}} = -\log \pi_\theta(\boldsymbol{\tau}) \cdot (r - \bar{r})$, where $\bar{r}$ is the mean reward over the mini-batch. The total training objective combines this with an entropy regularization term to encourage exploration:
\begin{equation}
\mathcal{L} = \mathcal{L}_{\text{PG}} - \lambda_{\text{ent}} \sum_t H(\pi_\theta(\cdot | s_t)) + \lambda_{\text{sup}} \mathcal{L}_{\text{BCE}}
\label{eq:loss}
\end{equation}

where the final term is a supervised cross-entropy loss that directly optimizes the classifier, providing stable gradients independent of the stochastic policy. To accelerate convergence, we initialize the encoders and classifiers from pre-trained baseline models and use separate learning rates for different components.

\section{Experiments}

We conduct comprehensive experiments to evaluate AdaFuse on the lung cancer risk prediction task. Section~\ref{sec:dataset} describes the dataset and feature extraction pipeline. Section~\ref{sec:baseline} introduces the baseline fusion strategies. Section~\ref{sec:comparison} compares AdaFuse against baseline methods. Section~\ref{sec:ablation} presents ablation studies on training configurations and learning objectives. Section~\ref{sec:external} presents external validation on an independent cohort.

\subsection{Dataset}
\label{sec:dataset}
We evaluate on the National Lung Screening Trial (NLST) dataset~\cite{national2011reduced}, a large-scale multi-center study of low-dose CT screening for lung cancer. To ensure fair evaluation, we use the held-out test set from Ardila et al.~\cite{ardila2019end} that was not seen during Sybil model~\cite{mikhael2023sybil} training, since our CT image features are extracted using the pre-trained Sybil encoder. The dataset contains 1,847 patients for training and 462 patients for testing, with lung cancer prevalence of 6.44\% and 6.06\% respectively. We use a binary classification task where the label indicates whether a participant was diagnosed with lung cancer at any point during the NLST follow-up period (up to 6 years), which aligns with the clinical objective of identifying high-risk individuals for continued surveillance.

\smallskip
\noindent\textbf{Feature Extraction.} For each patient, we extract three modalities: (1) \textit{CT image features} ($f^A \in \mathbb{R}^{512}$): extracted from the Sybil model, a state-of-the-art lung cancer risk prediction network trained on NLST; (2) \textit{Clinical variables} ($f^B \in \mathbb{R}^{17}$): risk factors from the PLCO$_{\text{m2012}}$ model~\cite{tammemagi2013selection} including age, smoking history, BMI, and family history; (3) \textit{Text embeddings} ($f^C \in \mathbb{R}^{768}$): we generate synthetic radiology reports from structured clinical variables covering occupational exposures (e.g., asbestos, chemical work, coal mining), medical history (e.g., diabetes, heart disease, hypertension), and secondhand smoke exposure, then extract embeddings using CORe~\cite{yang2020core}, a BERT-based model~\cite{lee2020biobert} pre-trained on chest radiograph reports.

\subsection{Baseline Fusion Strategies}
\label{sec:baseline}

\begin{wrapfigure}{r}{0.4\textwidth}
  \centering  \includegraphics[width=\linewidth]{img/fig_correlation.png}
  \caption{\textbf{Prediction correlation across baseline models.} Models containing CT features exhibit high mutual correlation, while text-only predictions show near-zero correlation with others.}
  \label{fig:correlation}
\end{wrapfigure}

We compare AdaFuse against 15 fixed fusion baselines covering all valid modality combinations, plus two adaptive baselines.

\smallskip
\noindent\textbf{Single-Modality Baselines.} We train three single-modality models ($A$, $B$, $C$) using only CT, clinical, or text features respectively.

\smallskip
\noindent\textbf{Operation-Based Fusion.} For multi-modality combinations, we evaluate three fusion operations: (1) \textit{Concatenation}: features are concatenated along the channel dimension; (2) \textit{Mean}: features are averaged element-wise after projection to a common dimension; (3) \textit{Tensor}: features are fused via Kronecker product following the Tensor Fusion Network formulation~\cite{zadeh2017tensor}. This yields 9 dual-modality models ($AB$, $AC$, $BC$ $\times$ 3 fusion types) and 3 triple-modality models ($ABC$ $\times$ 3 fusion types).

\smallskip
\noindent\textbf{Mixture-of-Experts (MoE).} We implement an MoE baseline with 15 expert classifiers covering all valid modality-fusion combinations (3 single-modality + 9 dual-modality + 3 triple-modality). The gating network is a 2-layer MLP (96→64→64→15) that takes concatenated encoded features from all modalities as input and outputs soft weights over experts. Expert models are pre-trained on respective modality combinations, then frozen during gating training. Unlike AdaFuse which makes discrete sequential selection decisions, MoE processes all experts simultaneously and learns continuous soft weights.

\smallskip
\noindent\textbf{Dynamic Multimodal Fusion (DynMM).} We implement DynMM~\cite{xue2023dynamic}, which uses Gumbel-Softmax gating to make parallel modality selection decisions. Unlike AdaFuse's sequential formulation, DynMM commits to all selection decisions simultaneously based on the initial input features.

\subsection{Comparison with Baselines}
\label{sec:comparison}

% \begin{wrapfigure}{r}{0.5\textwidth}
%   \centering  \includegraphics[width=\linewidth]{img/fig_correlation.png}
%   \caption{\textbf{Prediction correlation across baseline models.} Models containing CT features exhibit high mutual correlation, while text-only predictions show near-zero correlation with others.}
%   \label{fig:correlation}
% \end{wrapfigure}





% \definecolor{bestsingle}{HTML}{FAD8D4}
% \definecolor{bestdual}{HTML}{D6E9D5}
% \definecolor{besttriple}{HTML}{AFE3E6}

\begin{table}[h]
\centering
\scriptsize
\caption{\textbf{Test AUC and computational cost comparison on NLST.} We compare AdaFuse against single-modality baselines ($A$: CT image, $B$: clinical variables, $C$: text reports), operation-based fusion baselines with three fusion strategies (concatenation, mean pooling, tensor fusion), and adaptive baselines including MoE and DynMM. Background colors indicate modality count: \colorbox{bestsingle!40}{single-modality}, \colorbox{bestdual!40}{dual-modality}, and \colorbox{besttriple!40}{triple-modality}. MFLOPs denotes million floating-point operations. The best AUC within each category and the overall best are shown in \textbf{bold}.}
\label{tab:comparison}
\begin{tabular}{p{0.3\linewidth} P{0.15\linewidth} P{0.2\linewidth}}
\toprule
Method & AUC & MFLOPs \\
\midrule
$A$ (CT) & \cellcolor{bestsingle!40}\textbf{0.732} & \cellcolor{bestsingle!40}0.543 \\
$B$ (Clinical) & \cellcolor{bestsingle!40}0.662 & \cellcolor{bestsingle!40}0.017 \\
$C$ (Text) & \cellcolor{bestsingle!40}0.576 & \cellcolor{bestsingle!40}1.067 \\
\midrule
$AB$-concat & \cellcolor{bestdual!40}\textbf{0.758} & \cellcolor{bestdual!40}0.559 \\
$AB$-mean & \cellcolor{bestdual!40}0.755 & \cellcolor{bestdual!40}0.557 \\
$AB$-tensor & \cellcolor{bestdual!40}0.735 & \cellcolor{bestdual!40}0.433 \\
$AC$-concat & \cellcolor{bestdual!40}0.733 & \cellcolor{bestdual!40}1.610 \\
$AC$-mean & \cellcolor{bestdual!40}0.745 & \cellcolor{bestdual!40}1.608 \\
$AC$-tensor & \cellcolor{bestdual!40}0.739 & \cellcolor{bestdual!40}1.477 \\
$BC$-concat & \cellcolor{bestdual!40}0.661 & \cellcolor{bestdual!40}1.084 \\
$BC$-mean & \cellcolor{bestdual!40}0.678 & \cellcolor{bestdual!40}1.082 \\
$BC$-tensor & \cellcolor{bestdual!40}0.685 & \cellcolor{bestdual!40}1.088 \\
\midrule
$ABC$-concat & \cellcolor{besttriple!40}0.735 & \cellcolor{besttriple!40}1.626\\
$ABC$-mean & \cellcolor{besttriple!40}0.748 & \cellcolor{besttriple!40}1.622 \\
$ABC$-tensor & \cellcolor{besttriple!40}\textbf{0.759} & \cellcolor{besttriple!40}1.790 \\
MoE & \cellcolor{besttriple!40}0.742 & \cellcolor{besttriple!40}3.492 \\
DynMM & \cellcolor{besttriple!40}0.754 & \cellcolor{besttriple!40}1.635\\
\midrule
\textbf{AdaFuse (Ours)} & \textbf{0.762} & 1.164 \\
\bottomrule
\end{tabular}
\end{table}



Table~\ref{tab:comparison} presents the test AUC of all methods. AdaFuse achieves the highest AUC (0.762), compared to the best fixed fusion baseline $ABC$-tensor (0.759) and adaptive baselines including DynMM (0.754) and MoE (0.742). Several observations emerge from these results.


\smallskip
\noindent\textbf{CT features dominate prediction.} The single-modality CT model ($A$, 0.732) already achieves competitive performance, surpassing many multi-modality fusion methods. This aligns with the fact that CT image features are extracted from Sybil, a model specifically trained for lung cancer risk prediction.


\smallskip
\noindent\textbf{Text reports provide limited information.} The text-only model ($C$, 0.576) performs near random, and combinations involving text without CT ($BC$) yield consistently lower AUC. This is expected since our text reports are synthetically generated from structured clinical variables, which limits their informativeness compared to real radiology reports.


\smallskip
\noindent\textbf{Naive fusion can hurt performance.} Adding modalities does not guarantee improvement. For instance, $ABC$-concat (0.735) achieves lower AUC than $AB$-concat (0.758), suggesting that indiscriminately incorporating all modalities can introduce noise rather than complementary information.


\smallskip
\noindent\textbf{Sequential selection benefits from conditioning on observed modalities.} AdaFuse (0.762) achieves higher AUC than both DynMM (0.754), which uses parallel Gumbel-Softmax gating, and MoE (0.742), which learns soft modality weights. We chose sequential over parallel selection because it reflects how physicians interpret initial test results before deciding whether to order additional diagnostics. Sequential formulation allows each decision to be conditioned on previously observed modalities, whereas parallel approaches must commit to all selection decisions simultaneously based solely on the initial input features.

\smallskip
\noindent\textbf{Adaptive selection reduces computational cost.} Among adaptive methods, AdaFuse (1.164 MFLOPs) uses 29\% fewer FLOPs than DynMM (1.635 MFLOPs) and 67\% fewer FLOPs than MoE (3.492 MFLOPs). AdaFuse also uses fewer FLOPs than all triple-modality fixed fusion methods (1.622--1.790 MFLOPs). By learning to skip uninformative modalities for individual patients, AdaFuse achieves the highest AUC while maintaining lower computational cost.

\smallskip
\noindent\textbf{AdaFuse learns to filter uninformative modalities.} Figure~\ref{fig:correlation} shows the prediction correlation matrix across baseline models. We observe high correlation among models that include CT features, while text-only predictions show near-zero correlation with others. AdaFuse learns to leverage this structure: it predominantly selects CT-based combinations while adaptively incorporating clinical variables when beneficial, effectively filtering out the less informative text modality for most patients. Detailed analysis of the learned policy behavior is provided in Appendix Figure~\ref{fig:policy_analysis}.




\begin{figure}[h]
    \centering
    \includegraphics[width=0.9\textwidth]{img/fig_ablation.png}
    \caption{\textbf{Ablation study on training configurations.} From left to right: (1) freezing classifiers while training modality encoders; (2) freezing modality encoders while training classifiers; (3) training both components; (4) freezing both components. The flame icon indicates which components receive gradients during RL training. Quantitative results are provided in Table~\ref{tab:ablation_training}.}
    \label{fig:ablation_training}
\end{figure}
\subsection{Ablation Studies}
\label{sec:ablation}

We conduct ablation studies to analyze the contribution of each component in AdaFuse. Section~\ref{sec:ablation_training} investigates training configurations, and Section~\ref{sec:ablation_objective} examines the learning objective design.

\subsubsection{Training Configuration}
\label{sec:ablation_training}

Figure~\ref{fig:ablation_training} illustrates the four training configurations, and Table~\ref{tab:ablation_training} summarizes the results for freezing versus training encoders and classifiers during RL training.

\begin{table}[h]
\centering
\caption{\textbf{Ablation study on training configurations.} $\Delta$ denotes relative change compared to the best configuration.}
\label{tab:ablation_training}
\scriptsize
\begin{tabular}{P{0.2\linewidth} P{0.2\linewidth} P{0.2\linewidth} P{0.2\linewidth}}
\toprule
Encoder & Classifier & Test AUC & $\Delta$ \\
\midrule
Train & Freeze & \textbf{0.762} & -- \\
Freeze & Train & 0.722 & -5.25\% \\
Train & Train & 0.691 & -9.32\% \\
Freeze & Freeze & 0.674 & -11.55\% \\
\bottomrule
\end{tabular}
\end{table}

\smallskip\noindent\textbf{Freezing classifiers yields the best performance.} As shown in Table~\ref{tab:ablation_training}, training classifiers with policy gradients degrades AUC from 0.762 to 0.722 ($-5.25\%$), and jointly training both components further drops to 0.691 ($-9.32\%$). This is because unfrozen classifiers only receive gradients when selected, causing undertrained combinations to produce unreliable rewards that further discourage their selection. Freezing pretrained classifiers provides stable reward signals and allows the policy to focus on selection without shifting decision boundaries.

\smallskip\noindent\textbf{Training encoders is essential for policy learning.} The freeze-both configuration yields the worst performance (0.674, $-11.55\%$), indicating that encoder adaptation is necessary. Unlike classifiers that provide fixed decision boundaries, encoders must learn representations that help the policy distinguish when each modality combination is beneficial for a given patient.

\subsubsection{Learning Objective}
\label{sec:ablation_objective}

We ablate the reward function (Eq.~\ref{eq:reward}) and loss function (Eq.~\ref{eq:loss}) to understand the contribution of each component. Table~\ref{tab:ablation_objective} presents the results.

\begin{table}[h]
\centering
\caption{\textbf{Ablation study on learning objective.} We separately vary the loss composition (top) and reward design (bottom) while keeping the other fixed. $\mathcal{L}_{\text{PG}}$: policy gradient loss, $\mathcal{H}$: entropy regularization, $\mathcal{L}_{\text{sup}}$: supervised cross-entropy loss. $\Delta$ denotes relative change compared to the best configuration.}
\label{tab:ablation_objective}
\scriptsize
\begin{tabular}{p{0.12\linewidth} p{0.45\linewidth} P{0.12\linewidth} P{0.1\linewidth}}
\toprule
 & Configuration & Test AUC & $\Delta$ \\
\midrule
\textbf{Best} & $\mathcal{L}_{\text{PG}} + 0.1\mathcal{H} + 0.3\mathcal{L}_{\text{sup}}$, \; $r = 0.7 r_{\text{BCE}} + 0.3 r_{\text{AUC}}$ & \textbf{0.762} & -- \\
\midrule
\multirow{4}{*}{Same Reward} 
 & $\mathcal{L}_{\text{PG}} + 0.1\mathcal{H}$ (no supervision) & 0.696 & -8.7\% \\
 & $\mathcal{L}_{\text{PG}} + 0.3\mathcal{L}_{\text{sup}}$ (no entropy) & 0.610 & -20.0\% \\
 & $\mathcal{L}_{\text{PG}}$ (policy gradient only) & 0.689 & -9.6\% \\
 & $\mathcal{L}_{\text{PG}} + 1.0\mathcal{L}_{\text{sup}}$ (over-weighted supervision) & 0.641 & -15.9\% \\
\midrule
\multirow{2}{*}{Same Loss} 
 & $r = r_{\text{AUC}}$ (AUC reward only) & 0.647 & -15.1\% \\
 & $r = r_{\text{BCE}}$ (BCE reward only) & 0.637 & -16.4\% \\
\bottomrule
\end{tabular}
\end{table}

\smallskip\noindent\textbf{Both entropy and supervision are necessary for stable training.} Under the same reward, removing entropy regularization causes the largest performance drop from 0.762 to 0.610 ($-20\%$), as the policy converges prematurely to a narrow set of modality combinations before exploring alternatives. Removing supervision degrades AUC to 0.696, since the supervised loss provides stable gradients independent of stochastic action sampling. However, over-weighting supervision ($\lambda_{\text{sup}} = 1.0$) drops AUC to 0.641 by diminishing the influence of the reward signal.

\smallskip\noindent\textbf{Mixed reward outperforms single-objective alternatives.} Under the same loss, using either BCE or AUC reward alone degrades AUC to 0.637 and 0.647, respectively. BCE provides dense per-sample feedback but can be misleading in imbalanced settings. AUC rewards correct ranking and is essential when positive prevalence is only 6\%, but provides sparser feedback. The combination leverages the stability of BCE and the ranking awareness of AUC.

\subsection{External Validation}
\label{sec:external}
To assess generalizability, we evaluate AdaFuse on the Vanderbilt Lung Screening Program (VLSP) dataset, a private external cohort with 858 patients (2.8\% positive rate). VLSP contains only two modalities (CT images and clinical variables) as it lacks the comprehensive variables required to generate synthetic text reports. Table~\ref{tab:vlsp} presents the results.

\begin{table}[h]
\centering
\scriptsize
\caption{\textbf{External validation on VLSP dataset.} VLSP is an independent cohort from the Vanderbilt Lung Screening Program with 858 patients. Only modalities A (CT) and B (clinical) are available.}
\label{tab:vlsp}

\begin{tabular}{p{0.4\linewidth} P{0.25\linewidth}}
\toprule
Method & VLSP AUC \\
\midrule
$A$ (CT) & 0.771 \\
$B$ (Clinical) & 0.471 \\
$AB$-concat & 0.725 \\
$AB$-mean & 0.706 \\
$AB$-tensor & 0.590 \\
\midrule
AdaFuse & \textbf{0.749} \\
\bottomrule
\end{tabular}

\end{table}

CT features generalize well to VLSP (0.771 AUC), while clinical features show degraded performance (0.471 AUC). This negatively impacts all fixed dual-modality fusion methods, which are forced to incorporate clinical information regardless of its quality. AdaFuse achieves 0.749 AUC, higher than all dual-modality fusion baselines, demonstrating its ability to adaptively filter less informative modalities under distribution shift.


\section{Conclusion}
\label{sec:conclusion}

We presented AdaFuse, an adaptive multimodal fusion framework that addresses a fundamental question in multimodal learning: for a given patient, should certain modalities be used at all? By formulating modality selection as a sequential decision process, AdaFuse learns patient-specific fusion strategies with the flexibility to entirely exclude uninformative modalities rather than processing all inputs uniformly.

Our experiments on NLST reveal that the learned policy predominantly selects CT-based combinations while adaptively incorporating clinical variables, effectively filtering out the less informative text modality for most patients. This adaptive behavior outperforms fixed fusion strategies while demonstrating that uniform fusion is not always optimal. Ablation studies further show that freezing pretrained classifiers is essential for stable policy learning by providing reliable reward signals.

\smallskip\noindent\textbf{Potential Limitations.} Our evaluation is limited to three modalities on a single dataset, where text is synthetically generated. Future work could extend AdaFuse to real clinical text and more modalities, and investigate when multimodal integration provides the most benefit across patient subgroups.



\section*{Acknowledgements}
\begin{flushleft}
This research was supported by the National Institutes of Health (NIH) through grants F30CA275020, 2U01CA152662, R01CA253923 (Landman \& Maldonado), R01CA275015 (Maldonado \& Lenburg), U01CA152662 (Grogan), U01CA196405 (Maldonado), and P30CA068485-29S1, as well as the National Science Foundation (NSF) through CAREER 1452485 and grant 2040462. Additional support was provided by the Vanderbilt Institute for Surgery and Engineering through T32EB021937-07, the Vanderbilt Institute for Clinical and Translational Research via UL1TR002243-06, the Pierre Massion Directorship in Pulmonary Medicine, and the American College of Radiology Fund for Collaborative Research in Imaging (FCRI) Grant. This manuscript was polished using AI-assisted editing (ChatGPT) with a “rephrase” prompt; no scientific content was generated or altered.  
\end{flushleft}


\bibliography{midl26_223}
\input{appendix}
\end{document}



