Abstract: The advent of low cost sensors for measuring gaze, heart rate, EEG, and galvanic skin response have made it feasible to cheaply collect physiological data from human operators. However, leveraging this data for machine learning problems requires a good multimodal fusion architecture. When dealing with multimodal features, uncovering the correlations between different modalities is as crucial as identifying effective unimodal features. This paper proposes a hybrid multimodal tensor fusion network that is effective at learning both unimodal and bimodal dynamics for cognitive workload modeling. Our architecture comprises two parts: (1) intra-modality for learning high-level representations of each signal modality (2) inter-modality for modeling bimodal interactions using a tensor fusion layer created from the Cartesian product of modality embeddings. We compare this architecture to the usage of a cross-modal transformer fusion module that learns an inter-modality embedding. Experimental results conducted on the HP Omnicept Cognitive Load Database (HPO-CLD) show that both techniques outperform the most commonly used techniques used for multimodal fusion of physio-logical data and that the cross-modal transformer fusion module is especially effective.
Loading