\section{Experiments \& Results}
\subsection{Experiment Setting}
%%%%%%%%%%%% Dataset %%%%%%%%%%%%

\noindent \textbf{Datasets.} \texttt{\textbf{StressID}}~\cite{chaptoukaev2023stressid} for stress identification contains physiological responses via electrocardiogram (ECG), electrodermal activity, respiration, audio, and videos. We use the training, val, and test splits provided by~\citet{chaptoukaev2023stressid}.\\
\noindent \texttt{\textbf{LOC}}. We present the Loss Of Consciousness dataset, collected during aeromedical training of flight personnel by the French Ministry of Armed Forces (Appendix~\ref{appendix:loc}). It includes videos and biomedical sensor data: ECG, quadriceps electromyograms, acoustic breathing, pedal pressure, and self-reported visual field.
%
It comprises 1666 launches with 416 subjects, split into train, val, and test sets in a 6:2:2 ratio based on patient ID. We employ 5-fold cross-validation and report the average. Each launch is labeled for consciousness alteration. The dataset exhibits a high imbalance ratio of \{1:50\}.
%
In real life, videos are impractical due to pilots' equipment (helmets \& $O_2$ masks), despite being the primary modality used by doctors to monitor launches during aeromedical training (see Appendix~\ref{appendix:loc}). 

\noindent \textbf{Missing modalities.} \texttt{StressID} has  18\% and 46\% of missing video and audio recordings, respectively. For \texttt{LOC}, videos are absent in 90\% of observations. We denote the entire training and testing sets as $X_{\text{train}}$, $X_{\text{test}}$ (considering samples with and without missing modalities); and  $X_{\text{train}}^*$, $X_{\text{test}}^*$ for the train and test sets where all modalities are available.
%

\paragraph{Metrics.} We use the balanced accuracy (ACC) and weighted F1 score (F1). For \texttt{LOC}, to ensure robustness to class imbalance~\cite{huang_2016,luque2019impact}, we also report the true positive rate (TPR) as it ensures not missing out on pilots fainting; and the true negative rate (TNR) for completeness.  We report metrics in the format \texttt{mean(std)} in \%.

\paragraph{Implementation details.}
The Anchoring and Masked Multimodal Transformer are trained on $X_{\text{train}}^*$ and $X_{\text{train}}$, respectively.  A linear classifier is trained using the [CLS] token for the final task. We train for 70 epochs using AdamW optimizer, a starting learning rate of $1e^{-4}$, followed by a cosine schedule and a linear warm-up of 4 epochs.
Given their size difference, we set the batch size to 256 for \texttt{LOC} and 128 for \texttt{StressID}. To tackle \texttt{LOC} class imbalance, we use the Balanced Cross Entropy loss~\cite{huang_2016} (more in Appendix~\ref{appendix:implementations}).

%%%%%%%%%%%% Results %%%%%%%%%%%%
\subsection{Results}

%%%%%%% SOTA %%%%%%%
\noindent \textbf{Comparison to the state of the art} 
(Table~\ref{tab:baseline}) in the presence of all modalities. 
%
We compare ADAPT against unimodal baselines for video, audio, and biomedical signals concatenated (rows 1, 2 \& 3), `feature fusion' and `decision-level fusion' (rows 4 \& 5)\cite{chaptoukaev2023stressid}, ShaSpec+~\cite{wang2023multi} (row 6) (more in Appendices~\ref{appendix:sota} and~\ref{appendix:additional-results}). 

For \texttt{StressID}, we observe that ADAPT outperforms all methods from~\citet{chaptoukaev2023stressid} by a notable margin; for instance, it outperforms `decision-level fusion' by 4\% in ACC and 6\% in F1. Additionally, it remains highly competitive with ShaSpec+. 

For \texttt{LOC}, using only video (row 1) results in the best performance; however, video is unavailable in real-life scenarios. Instead, ADAPT handles missing modalities by leveraging representations from all modalities during training. 
Surprisingly, `feature fusion',  `decision-level fusion', and ShaSpec+ lead to unbalanced metrics, i.e., they result in a  high TNR while significantly sacrificing TPR (respectively 29.5\%, 20.4\% and 7.3\%), showing they predict most samples as negative. 
This reveals their unsuitability for real-life cases with highly imbalanced classes where both TPR and TNR matter. Note that in our target scenarios, TPR is more important, as it is critical to detect pilots losing consciousness. By contrast, ADAPT results in a TPR of 69.5\% (+40\% vs. fusion methods) while maintaining a balanced TNR of 65.3\%. This is further shown in Figure~\ref{fig:tnr-tpr}, where ADAPT (blue crosses) strikes the best balance between TPR and TNR vs. other methods (red crosses). 
This testifies to ADAPT not being misled by the high class imbalance.

\input{tables/01-sota-bis}

%%%%%%%%%%%% Missing Modalities %%%%%%%%%%%%
\paragraph{Robustness to missing modalities} (Table~\ref{tab:scenarios}). We first report baseline results (row 1) on the default test set $X_{\text{test}}$, i.e., no modality removed in \texttt{StressID} and 90\% of videos missing for \texttt{LOC}. 
Then, we completely remove one or two modalities from $X_{\text{test}}$ and compare the results ($\Delta$) to the ones obtained on $X_{\text{test}}$. 

\begin{wrapfigure}{r}{0.36\textwidth}
  \centering
  \vspace{-2mm}
  \includegraphics[width=0.35\textwidth]{figures/output-13.png}
  \captionsetup{font=footnotesize}
    \caption{\textbf{TPR vs TNR  for \texttt{LOC}.}
    \textsuperscript{\textdagger}Methods from ~\cite{chaptoukaev2023stressid}}
    \label{fig:tnr-tpr}
\end{wrapfigure}

For \texttt{LOC}, ADAPT shows robustness in all scenarios, with a $|\Delta| {<} 8\%$ and average $|\Delta| {=} 2.6$ compared to the baseline. This is further shown in Figure~\ref{fig:tnr-tpr}, where the balance between TNR and TPR (blue circles) remains consistent across all scenarios. 
Interestingly, for \textit{no-video}, even though video-only provides strong unimodal performance (Table~\ref{tab:unimodal-loc}), ADAPT maintains high performances, indicating its capability of aligning representations in the video (anchor) space.
Furthermore, for the \textit{real-life} scenario where we remove both video and visual field (row 2), the results remain competitive with an average $|\Delta| {=} 2.72\%$, even though these modalities individually perform the best (Table~\ref{tab:unimodal-loc}). Additionally, \textit{no-audio}  demonstrates consistent results, keeping the TNR and TPR balanced (see Appendix~\ref{appendix:additional-results}). 

For \texttt{StressID}, we remove audio and/or video, the most cumbersome modalities to acquire and examine the \textit{no-audio}, \textit{no-video} and \textit{real-life} (i.e., no audio, no video) scenarios. 
The variation remains consistent for both \textit{no-audio} and \textit{no-video}: $|\Delta| {<} 8.3\%$. However, it is more consequent for \textit{real-life}, with a significant drop in TPR for an equivalent TNR, as expected as we remove the richest modalities. 

Overall, even by removing modalities, ADAPT successfully detects stress or loss of consciousness with more than 60\% ACC and more than 50\% TPR, highlighting its ability to handle missing modalities, in contrast to all other methods unable to address this.

%%%%%%%%%% Ablations %%%%%%%%%%
\input{tables/02-ablations}
\paragraph{Ablations.}
\textbf{1. Impact of the anchoring before fusion and choice of anchor} (Table~\ref{tab:variant-adapt}). Anchoring with video shows significant benefits, particularly in \texttt{LOC} with an 11.6\% increase in ACC alongside consistent F1 scores. Similarly, for \texttt{StressID}, anchoring improves both ACC and F1 by 3.7\%. Any anchor may be considered; we explore using the audio (row 3), but it leads to suboptimal performances. Overall, the anchor selection is driven by its robust unimodal performance, which remains effective despite high missing modalities. \\
\textbf{2. Impact of feature configurations and fusion methods} (Table~\ref{tab:stress-adapt-sota}). 
%Additionally, in Table~\ref{tab:stress-adapt-sota}, we explore the impact of various feature configurations and fusion methods as proposed in~\citet{chaptoukaev2023stressid} on \texttt{StressID}. 
Compared to the `feature fusion' and `decision-level fusion' (rows 1,2, \citet{chaptoukaev2023stressid}), our features and fusion method (last row) significantly increase ACC and F1 by 5.7\% and 6.8\%, respectively, further highlighting the advantages of \textit{anchoring}. We also investigate applying \textit{anchoring} to features from~\citet{chaptoukaev2023stressid} (row 3) by solely training the projection head, as opposed to both the encoder and projection head. Although this yields decent results, the inability to learn features optimally is a drawback. Finally, the ADAPT entire pipeline (row 6) delivers competitive results while accommodating missing modalities.