2M-AF: A Strong Multi-Modality Framework For Human Action Quality Assessment with Self-supervised Representation Learning

Yuning Ding; Sifan Zhang; Shenglan Liu; Jinrong Zhang; Wenyue Chen; Duan Haifei; bingcheng dong; Tao Sun

2M-AF: A Strong Multi-Modality Framework For Human Action Quality Assessment with Self-supervised Representation Learning

Yuning Ding, Sifan Zhang, Shenglan Liu, Jinrong Zhang, Wenyue Chen, Duan Haifei, bingcheng dong, Tao Sun

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Human Action Quality Assessment (AQA) is a prominent area of research in human action analysis. Current mainstream methods only consider the RGB modality which results in limited feature representation and insufficient performance due to the complexity of the AQA task. In this paper, we propose a simple and modular framework called the Two-Modality Assessment Framework (2M-AF), which comprises a skeleton stream, an RGB stream and a regression module. For the skeleton stream, we develop the Self-supervised Mask Encoder Graph Convolution Network (SME-GCN) to achieve representation learning, and further implement score assessment. Additionally, we propose a Preference Fusion Module (PFM) to fuse features, which can effectively avoid the disadvantages of different modalities. Our experimental results demonstrate the superiority of the proposed 2M-AF over current state-of-the-art methods on three publicly available datasets: AQA-7, UNLV-Diving, and MMFS-63.

Primary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: Current Action Quality Assessment methods only consider the RGB modality which results in limited feature representation and insufficient performance due to the complexity of the AQA task. In this paper, we propose a simple and modular framework called the Two-Modality Assessment Framework (2M-AF), which comprises a skeleton stream, an RGB stream and a regression module. For the skeleton stream, we develop the Self-supervised Mask Encoder Graph Convolution Network (SME-GCN) to achieve representation learning, and further implement score assessment. Additionally, we propose a Preference Fusion Module (PFM) to fuse features, which can effectively avoid the disadvantages of different modalities. Our experimental results demonstrate the superiority of the proposed 2M-AF over current state-of-the-art methods on three publicly available datasets: AQA-7, UNLV-Diving, and MMFS-63.

Supplementary Material: zip

Submission Number: 2477

Loading