Abstract: In the rapidly advancing field of computer vision, the application of multimodal models—specifically, vision-language frameworks—has shown substantial promise for complex tasks such as video-based action spotting. This paper introduces Soccer-CLIP, a vision-language model specially designed for soccer action spotting. Soccer-CLIP incorporates an innovative domain-specific prompt engineering strategy, leveraging large language models (LLMs) to refine textual representations for precise alignment with soccer-specific actions. Our model integrates both visual and textual features to enhance recognition accuracy of critical soccer events. With the temporal augmentation techniques devised for input videos, Soccer-CLIP builds upon existing methodologies to address the inherent challenges of temporally sparse event annotations within video sequences. Evaluations on the SoccerNet Action Spotting benchmark demonstrate that Soccer-CLIP outperforms previous state-of-the-art models, exploring the effectiveness of our model’s capacity to capture domain-specific contextual nuances. This work represents a significant advancement in automated sports analysis, providing a robust and adaptable framework for broader applications in video recognition and temporal action localization tasks.
External IDs:dblp:journals/access/ShinPHJLK25
Loading