Keywords: action difference reasoning
TL;DR: We introduce a novel task that jointly provides quantitative and qualitative evaluations on athlete performance and propose a keypoint-guided Monte Carlo Tree Search framework to model the reasoning process behind performance differences.
Abstract: Analyzing fine-grained differences in skilled activities, such as sports or surgery, poses a significant challenge for computer vision, demanding both precise action understanding and domain-specific reasoning. While prior work has made progress in evaluating individual performance, existing methods fall short in comparing two similar actions (e.g., penalty kick in soccer) conducted by different performers and explaining \textit{how} their actions differ. To address this gap, we introduce \textbf{Action Difference Reasoning (ADR)}, a novel task that jointly provides \textit{quantitative} performance scores and \textit{qualitative} explanations of inter-performer differences, enabling actionable feedback for improvement. To support this task, we construct the ADR dataset, built upon Ego-Exo4D dataset, comprising paired videos annotated with both performance scores and natural language descriptions of action differences. We further propose \textbf{KEPT}, a \textit{\textbf{ke}y\textbf{p}oint guided \textbf{t}ree search} framework that explicitly models the reasoning process behind performance differences by capturing fine-grained kinematic cues. Experiments on the ADR dataset show that KEPT significantly outperforms existing baselines, including large vision-language models, on both score prediction and action difference explanation. Moreover, our framework generalizes effectively to traditional Action Quality Assessment (AQA) settings, surpassing state-of-the-art approaches on benchmarks including JIGSAWS and FitnessAQA. Code, model and dataset will be released after the review process.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5430
Loading