Multiscale Multimodal Transformer for Multimodal Action RecognitionDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: audio and video classification, multimodal action recognition
Abstract: While action recognition has been an active research area for several years, most existing approaches merely leverage the video modality as opposed to humans that efficiently process video and audio cues simultaneously. This limits the usage of recent models to applications where the actions are visually well-defined. On the other hand, audio and video can be perceived in a hierarchical structure, e.g., from audio signal per sampling time point to audio activities and the whole category in the audio classification. In this work, we develop a multiscale multimodal Transformer (MMT) that employs hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer. Furthermore, we propose a set of multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that specifically align the two modalities for robust multimodal representation fusion. MMT surpasses previous state-of-the-art approaches by 7.3%, 1.6% and 2.1% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. Moreover, our MAT significantly outperforms AST by 22.2%, 4.4% and 4.7% on the three public benchmark datasets and is 3x more efficient based on the number of FLOPs. Through extensive ablation studies and visualizations, we demonstrate that the proposed MMT can effectively capture semantically more separable feature representations from a combination of video and audio signals.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
4 Replies

Loading