\section{Introduction}
% 1st paragraph
Monitoring physiological changes to external stimuli is crucial for assessing individuals' well-being, particularly in contexts with medical and safety implications. 
%
Examples include stress, a response to emotional, mental, and physical challenges~\cite{schneiderman2005stress}, and a triggering or aggravating factor for various pathological conditions~\cite{dimsdale2008psychological}. High-performance environments, such as exposure to $g$-forces in aircraft, can lead to alterations in consciousness~\cite{morrissette2000further}. At the same time, drowsiness during driving poses a critical physiological response with safety implications, contributing to road accidents and fatalities~\cite{stewart2023overview}. 
%
Various sensors report physiological changes that may be detected visually (videos), acoustically (audio), or from biomedical signals (e.g., electrocardiograms). 
%
However, specific modalities may be missing during training and testing. 
Therefore, developing methods capable of handling missing modalities during both stages while balancing modalities' contributions is crucial to ensure robustness, notably when modalities with strong unimodal performances are severely missing.

Various methods address the challenge of missing modalities, each with notable limitations, including 
%
(1) bias towards the most available modalities leading to sub-optimal performance~\cite{konwer2023enhancing}, 
%
(2) dependence on complete modalities during training~\cite{mallya2022deep,chen2021learning} \iffalse,hu2020knowledge,yoon2018GAIN, remove ref\fi, 
%
(3) limited generalizability to more than two modalities~\cite{ma2021smil,ma2022multimodal}, and 
%
(4) utilization of a shared encoder tailored for modalities with inputs of the same dimensions which complicates extension to heterogeneous modalities like imaging and biomedical signals~\cite{konwer2023enhancing}.

To address the above issues, we introduce the \textbf{A}nchore\textbf{D} multimod\textbf{A}l \textbf{P}hysiological \textbf{T}ransformer (ADAPT) that is designed to operate effectively under missing modalities both during training and inference \iffalse, ensuring robust performances across diverse scenarios and\fi enabling robust real-life applicability. ADAPT consists of two key components. 
%
First, our goal is to embed all modalities in the same feature space. Instead of optimizing one loss per modality pair, which would result in quadratic growth of training time, we align each modality to one frozen modality, called \emph{anchor}.
It allows learning a joint embedding space with linear scalability and balancing each modality's contribution. We call this step the `anchoring'.
%
Second, it comprises a Masked Multimodal Transformer that leverages  inter- and intra-modality correlations to concatenate features from different modalities into a unified representation. Additionally, we leverage masked attention from the transformers~\cite{vaswani2017attention} to ensure flexibility in handling missing modalities similarly to~\citet{ma2022multimodal,milecki2022contrastive}. When a modality is unavailable, its corresponding feature representation is masked. 
The transformer is trained using two objectives: self-supervised learning and the objective of the downstream task. 

ADAPT is applied to the challenging task of detecting physiological changes using multimodal medical data with missing modalities during training and inference. Specifically, we focus on detecting alterations in pilots' consciousness induced by $g$-forces in fighter jets and stress triggered in individuals by specific stimuli~\cite{chaptoukaev2023stressid}. 
We show that ADAPT outperforms the previous state of the art on both tasks and datasets while handling missing modalities. Extensive experiments demonstrate its robustness against missing modalities across various scenarios, highlighting its effectiveness for real-life applications. 

Our contributions are: (i) ADAPT, a modular framework that aligns multimodal representations to a common rich feature space; (ii) a modality-fusion strategy to handle missing modalities both at training and inference time; (iii) we set the new state of the art on two tasks and datasets and provide extensive evaluations highlighting ADAPT's superiority.