Attention-driven multimodal alignment for long-term action quality assessment

Xin Wang, Yuan-Yuan Shen

Published: 10 Jul 2025, Last Modified: 13 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY-NC-ND 4.0

Abstract: Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes. This task plays an important role in the automated evaluation of artistic sports such as rhythmic gymnastics and figure skating, where both accurate motion execution and temporal synchronization with background music are essential for performance assessment. However, existing methods predominantly fall into two categories: unimodal approaches that rely solely on visual features, which are inadequate for modeling multimodal cues like music; and multimodal approaches that typically employ simple feature-level contrastive fusion, overlooking deep cross-modal collaboration and temporal dynamics. As a result, they struggle to capture complex interactions between modalities and fail to accurately track critical performance changes throughout extended sequences. To address these challenges, we propose the Long- term Multimodal Attention Consistency Network (LMAC-Net). LMAC-Net introduces a multimodal attention consistency mechanism that explicitly aligns features across different modalities, enabling stable integration of complementary multimodal information and significantly enhancing feature representation capabilities. Specifically, a multimodal local query encoder module with learnable queries is designed to automatically capture temporal semantics within each modality while dynamically modeling complementary relationships across modalities. To ensure interpretable evaluation results, we adopt a two-level score evaluation module, where stage-wise scores are first calculated to generate a final overall score. Additionally, we apply attention- based feature-level and regression-based result-level loss to jointly optimize multimodal alignment and decision-layer fusion. Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods, validating the effectiveness of our proposed approach.