Audio-Visual Self-Supervision for Frame-Level Player-wise Offensive Shot Detection in Table Tennis Matches

Published: 01 Jan 2024, Last Modified: 10 Nov 2024MMSports@MM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Understanding decision-making processes is informative for strategic planning. Aiming to understand human risk-taking behavior in decision-making, we investigate the possibility of classifying whether a shot is offensive or not, targeting table tennis videos. We define the problem in a multi-task setting: detecting shots with frame-level precision while classifying shot offensiveness, and, as an optional task, predicting which player made the shot. We use commercial table tennis videos for target analysis and propose audio-visual self-supervised training, leveraging web videos with similar camera views. Our local contrastive loss encourages the model to learn frame-wise action locality collaboratively with traditional segment-wise contrastive loss, which we call global contrastive loss. Experimental results proved that the collaboration of two contrastive losses boosts the prediction performance.
Loading