%%%% NEW SECTION : IMPLEMENTATION DETAILS %%%%%
\section{Implementations details.}
\subsection{Architectural and training details.}\label{appendix:implementations}
\input{tables/hyperparams}
\noindent Table~\ref{tab:hyperparams} lists the architectural details used for each dataset. The seed is fixed to 1999.
In our experiments, for both stages we use the AdamW~\cite{loshchilov2018decoupled} optimizer, with a weight decay of 0.05, a starting learning rate of $1e^{-4}$ following a  cosine schedule, and preceded by a linear warm-up of 4 epochs on 4 NVIDIA Tesla V100
GPU using Pytorch~\cite{paszke2019pytorch}. The gradients' norms are clipped to 1, to ensure stability during training. 

\noindent \emph{Augmentations used for multi-view contrastive learning.} We used data augmentation with the sequential application, with each a 0.5 probability of Gaussian noise and modality dropout.


\subsection{Comparison to the state of the art.}\label{appendix:sota} 
We compare our approach directly with the methods utilized in ~\citet{chaptoukaev2023stressid}, which established the state-of-the-art for \texttt{StressID}: 
\begin{enumerate}
    \item `feature-level fusion': unimodal features 
features are combined into a single high-dimensional feature vector, used as input to a MLP trained with Cross-Entropy loss for \texttt{StressID} and Cost-Sensitive Cross Entropy~\cite{huang_2016} for \texttt{LOC} to tackle class imbalance and fair comparison with ADAPT.
    \item `decision-level fusion': independent SVMs are trained for each modality using the unimodal features as input, and integrate the results of the
    individual classifiers at the decision level, i.e. the results are combined into a single decision using ensemble rules. Four decision rules are proposed in~\citet{chaptoukaev2023stressid}: sum rule fusion, average rule fusion, product rule fusion and maximum rule fusion. The best out of the four decision rules is reported.  
\end{enumerate}
Additionally, we implemented ShaSpec~\cite{wang2023multi}. ShaSpec maximizes the utilization of all available input modalities during training and evaluation by learning shared and specific features for better data representation. However, due to the varying dimensions of our input modalities (e.g., 3D video, 1D biomedical signals), employing a shared encoder is nontrivial. Hence, we adopt an adapted version, \emph{ShaSpec+}, where encoded inputs are fed to the shared encoder instead of raw inputs.  To ensure fair comparison with ADAPT, we use identical settings, including the same encoders: Hiera~\cite{hiera} for video, Byol-a~\cite{niizumi2021byol} for audio, and 1D CNN~\cite{wang2023contrast} for biomedical signals. Additionally, to address class imbalance for the \texttt{LOC} dataset we substitute the cross-entropy loss with cost-sensitive cross-entropy~\cite{huang_2016}.
