Abstract: Long-term action quality assessment poses a challenging visual task since it requires assessing technical actions at different skill levels in a long video. Recent state-of-the-art methods incorporate additional modality information to aid in understanding action semantics, which incurs extra annotation costs and imposes higher constraints on action scenes and datasets. To address this issue, we propose a Quality-Guided Vision-Language Learning (QGVL) method to map visual features into appropriate fine-grained intervals of quality scores. Specifically, we use a set of quality-related textual prompts as quality prototypes to guide the discrimination and aggregation of specific visual actions. To avoid fuzzy rule mapping, we further propose a progressive semantic learning strategy with a Granularity-Adaptive Semantic Learning Module (GSLM) that refines accurate score intervals from coarse to fine at clip, grade, and score levels. The quality-related semantics we designed are universal to all types of action scenarios without any additional annotations. Extensive experiments show that our approach outperforms previous work by a significant margin and establishes new state-of-the-art on four public AQA benchmarks: Rhythmic Gymnastics, Fis-V, FS1000, and FineFS.
External IDs:dblp:journals/tmm/XuWKLXG25
Loading