Abstract: This study introduces an automated system for fine-grained stroke recognition in broadcast table tennis videos, designed to address challenges in manual annotation and tactical analysis during international competitions. The proposed framework integrates an Adaptive Temporal Difference Model with a Transformer Encoder (ATDT), leveraging a combination of Temporal Difference Networks (TDN) and Temporal Adaptive Modules (TAM) to enhance spatial and temporal feature extraction. To enhance feature discriminability, we employ supervised contrastive learning, which promotes better representation learning for fine-grained action recognition. The system is divided into two primary modules: the Action Segmentation Module (ASM) and the Action Recognition Module (ARM). ASM precisely identifies the start and end times of each stroke action by incorporating ball trajectory analysis to identify precise hit timings and placements. The precise segmentation facilitates the subsequent ARM to implement a three-stage recognition process: forehand and backhand classification, group-based classification, and intra-group action classification. This hierarchical approach improves the system’s ability to differentiate between subtle stroke variations, even under the constraints of low-resolution broadcast footage. To validate the framework, the MISTT dataset was collected, comprising 3,618 stroke action clips from 18 international matches, with professional player annotations. The proposed ATDT model outperformed existing methods, achieving a top-1 accuracy improvement of 18% for forehand strokes and 25.58% for backhand strokes compared to baseline models. Moreover, our automatic annotation system takes only 1/30 of the time compared to the manual annotation process, demonstrating its efficiency.
External IDs:doi:10.1145/3769299
Loading