Keywords: robustness, out-of-distribution generalization, human activity recognition, multimodal fusion
Abstract: Robustness to distribution shifts remains a challenge for vision-based recognition. While documented in video classification [1,2], robustness evaluations in human activity recognition (HAR) are often limited to controlled benchmarks or synthetic corruptions. Less is known about robustness to natural variations in environment, embodiment, and camera viewpoint, or its relation to task [3].
We study robustness using a multimodal dataset of daily activities, capturing unscripted recordings across multiple environments, body positions, and camera views [5]. The DARai dataset provides natural shifts that approximate out-of-distribution conditions. We evaluate current vision models across three scenarios: cross-view, cross-body, and cross-environment recognition. Performance is analyzed at hierarchical annotation levels from coarse (L1) to fine-grained (L3), per recent video activity benchmarks [4].
Submission Number: 239
Loading