AHFu-Net: Align, Hallucinate, and Fuse Network for Missing Multimodal Action Recognition

Published: 01 Jan 2023, Last Modified: 06 Dec 2024VCIP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this work, we explore the multimodal action recognition problem, specifically in the context of RGB-Depth modalities scenario, where a subset of the learning modalities is missing at inference time. To address this issue, we construct a hallucination network to generate missing modality information from the available modality at inference time. We propose key components of an effective spatio-temporal encoder for strong unimodal performance with Local Patch Temporal Transformer (LPTT) and Spatial Encoder Transformer (SET), alignment of multi-modal features, and fusion strategy with our Multimodal Bottleneck Transformer Fusion Module (MMBTF). We incorporate these ideas into a novel framework named AHFu-Net (Align, Hallucinate, and Fuse network) for RGB-Depth action recognition. Our experiments demonstrate that AHFu Net achieves state-of-the-art performance while maintaining high accuracy in the case of missing modality on multimodal datasets of NTU-RGB+D and NWUCLA.
Loading