Abstract: Temporal action localization is a classic computer vision problem in video understanding with a wide range of
applications. In the context of sports videos, it is integrated into most of the current solutions used by coaches,
broadcasters and game specialists to assist in performance analysis, strategy development, and enhancing
the viewing experience. This work presents an application study on temporal action localization for tennis
broadcast videos. We study and evaluate a foundational video understanding model for identifying tennis
actions in match footage. We explore its architecture, specifically the state space model, from video input to the
prediction of temporal segments and classification labels. Our experiments provide findings and interpretations
of the model’s performance on tennis data. We achieved an average mean Average Precision (mAP) of 66.14%
over all thresholds on the TenniSet dataset, surpassing the other methods, and 96.16% on our private French
Open dataset.
Loading