Attention-driven multimodal alignment for long-term action quality assessment
Abstract: Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos
lasting up to several minutes. This task plays an important role in the automated evaluation of artistic
sports such as rhythmic gymnastics and figure skating, where both accurate motion execution and temporal
synchronization with background music are essential for performance assessment. However, existing methods
predominantly fall into two categories: unimodal approaches that rely solely on visual features, which are
inadequate for modeling multimodal cues like music; and multimodal approaches that typically employ simple
feature-level contrastive fusion, overlooking deep cross-modal collaboration and temporal dynamics. As a
result, they struggle to capture complex interactions between modalities and fail to accurately track critical
performance changes throughout extended sequences. To address these challenges, we propose the Long-
term Multimodal Attention Consistency Network (LMAC-Net). LMAC-Net introduces a multimodal attention
consistency mechanism that explicitly aligns features across different modalities, enabling stable integration
of complementary multimodal information and significantly enhancing feature representation capabilities.
Specifically, a multimodal local query encoder module with learnable queries is designed to automatically
capture temporal semantics within each modality while dynamically modeling complementary relationships
across modalities. To ensure interpretable evaluation results, we adopt a two-level score evaluation module,
where stage-wise scores are first calculated to generate a final overall score. Additionally, we apply attention-
based feature-level and regression-based result-level loss to jointly optimize multimodal alignment and
decision-layer fusion. Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net
significantly outperforms existing methods, validating the effectiveness of our proposed approach.
Loading