Multimodal Learning from Egocentric Videos and Motion Data for Action Recognition

NeurIPS 2023 Workshop Gaze Meets ML Submission12 Authors

06 Oct 2023 (modified: 27 Oct 2023)Submitted to Gaze Meets ML 2023EveryoneRevisionsBibTeX
Keywords: Multimodal Learning, Action Recognition, Deep Learning, Egocentric Videos, Eye Gaze, Hand Pose, Head Pose
TL;DR: We investigate whether eye gaze, hand and head pose improve egocentric vision-based action recognition, which remains challenging due to issues like partial visibility of the user and abrupt camera movements.
Abstract: Action recognition from egocentric videos remains challenging due to issues like partial visibility of the user and abrupt camera movements. To address these challenges, we propose a multimodal approach combining vision data from egocentric videos with motion data from head-mounted sensors to recognize everyday office activities like typing on a keyboard, reading a document, or drinking from a mug. To evaluate our approach, we used a dataset of egocentric videos and sensor readings from 17 subjects performing these activities. Our multimodal model fuses image features extracted from videos using deep convolutional networks with motion features from eye gaze, hand tracking, and head pose sensors. The fused representation is used to train a classifier that distinguishes between 14 activities. Our approach achieves an F1 score of 84.36%, outperforming unimodal – vision-only and sensor-only – baselines by up to 33 percentage points. The results demonstrate that body tracking technology can partly compensate for the limitations of egocentric videos, enabling more accurate activity recognition performance by 1 – 2 percentage points. The inclusion of eye gaze data enhances the classification accuracy for actions that entail precise eye movements, such as reading and using a phone.
Submission Type: Full Paper
Submission Number: 12
Loading