Multi-level Multi-modal Feature Fusion for Action Recognition in Videos

Xinghang Hu, Yanli Ji, Kumie Alemu Gedamu

2022 (modified: 24 Apr 2023)HCMA@MM 2022Readers: Everyone

Abstract: Several multi-modal feature fusion approaches have been proposed in recent years in order to improve action recognition in videos. These approaches do not take full advantage of the multi-modal information in the videos, since they are biased towards a single modality or treat modalities separately. To address the multi-modal problem, we propose a Multi-Level Multi-modal feature Fusion (MLMF) for action recognition in videos. The MLMF projects each modality to shared and specific feature spaces. According to the similarity between the two modal shared features space, we augment the features in the specific feature space. As a result, the fused features not only incorporate the unique characteristics of the two modalities, but also explicitly emphasize their similarities. Moreover, the video's action segments differ in length, so the model needs to consider different-level feature ensembling for fine-grained action recognition. The optimal multi-level unified action feature representation is achieved by aggregating features at different levels. Our approach is evaluated in the EPIC-KITCHEN 100 dataset, and achieved encouraging results of action recognition in videos.

0 Replies