Action Selection Learning for Weakly Labeled Multi-modal Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide

Published: 17 Jun 2025, Last Modified: 22 Nov 2025ACM Transactions on Multimedia Computing, Communications, and ApplicationsEveryoneRevisionsCC BY-SA 4.0

Abstract: Multi-view action recognition is a critical task in computer vision, with broad applications in surveillance, robotics, and video-content analysis. Traditional single-view action recognition approaches suffer from a limited field of view and occlusion, leading to incomplete understanding of actions and a higher likelihood of misclassification. Moreover, most existing methods rely on constrained environments with strong label annotations, where the onset and offset times of each action are meticulously labeled at the frame-level. However, annotating strong labels for multi-view video sequences in real-world scenarios is time consuming and labor intensive. In many cases, only a weak video-level (sequence-level) label is available, where only the action class label for the entire video sequence is provided. This limits the performance of accurate action recognition. To overcome this limitation, we propose Multi-modal Multi-view Action Selection Learning (MMASL), which integrates audio and video data to perform frame-level action recognition in large-area environments using sequence-level weak labels. The key components of MMASL include modality-specific Shared Audio Encoder and Shared Video Encoder, and an Action Selection Learning (ASL) mechanism. The encoder processes input data from multiple views by extracting and unifying features from audio and video modalities. Meanwhile, ASL dynamically selects relevant frames across views and filters out irrelevant information while focusing on critical action segments to enhance action recognition accuracy. By incorporating audio data with video data, MMASL improves recognition accuracy for visually ambiguous actions that are distinguishable through sound. Experiments in a real-world office environment using the MM-Office dataset demonstrate that MMASL outperforms state-of-the-art methods, achieving up to 8.81% improvement in mAPC (Class-wise mean Average Precision) and 8.43% in mAPS (Sample-wise mean Average Precision), highlighting the significance of multi-modal multi-view action recognition with ASL in real-world scenarios.

External IDs:doi:10.1145/3744742