\section{Related Work}
\noindent \textbf{Handling missing modalities.} Missing modalities pose a persistent challenge in Multimodal Learning, particularly in medical imaging, due to privacy concerns or impractical data acquisition~\cite{liang2021multibench,azad2022medical}. Various strategies have been explored to address this challenge. \textit{Knowledge Distillation} \cite{hu2020knowledge,mallya2022deep,wang2023learnable} involves learning from a teacher network trained on complete modality data. \textit{Generative modeling} aims to impute missing inputs by generating synthetic data \cite{yoon2018GAIN,sharma2019missing}. Both approaches rely on complete modality at training, which can be insufficient for robust training. Another line of work is \emph{common space modeling}, which learns a shared latent space from partially available modalities \cite{ma2021smil,konwer2023enhancing,wang2023multi}. SMIL~\cite{ma2021smil} perturbs the latent feature space to approximate the embedding of missing modalities but is limited to bi-modal datasets, limiting its generalizability.
%
ShaSpec \cite{wang2023multi} addresses more than two modalities by learning shared and specific features, but its use of a shared encoder complicates generalization to heterogeneous modalities of different dimensions. 
%
Additionally, shared latent space modeling may introduce biases toward the most available modalities~\cite{konwer2023enhancing}.
%
Simultaneously, cross-modal contrastive learning has shown impressive results~\cite{zhang2022contrastive, milecki2023medimp, li2023frozen} by aligning multimodal data in a joint embedding space.
%
Recently, ImageBind~\cite{girdhar2023imagebind} aligned six modalities by relying on image-paired data and emphasized that aligning all pair combinations is unnecessary to bind more than two modalities together. 
Our proposed ADAPT advances this by training unimodal encoders solely with supervision from one modality, aligning them in a joint embedding space. It ensures that every modality contributes to the final representation, even if it is severely missing during training.

\paragraph{Multimodal transformer.} Transformers~\cite{vaswani2017attention} have become the de facto approach for multimodal tasks~\cite{recasens2023zorro,srivastava2024omnivec}. They rely on the attention mechanism to model long-range dependencies with the flexibility to account for incomplete samples. \citet{milecki2022contrastive,ma2022multimodal} efficiently handle missing data in sequences and bimodal datasets through masked attention.
ADAPT extends this to more than two modalities by leveraging attention to fuse them and exploring their inter- and intra-modal correlations while masking missing ones.
We also perform a systematic study of missing modalities during training and testing, showing the versatility and potential of ADAPT for real-life scenarios.