TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, Changbo Wang

Published: 27 Oct 2025, Last Modified: 05 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) show promising capabilities in temporal grounding and video understanding. However, generating soccer commentary requires both precise temporal localization and semantically rich descriptions over long-form videos. Existing soccer MLLMs often rely on temporal priors for caption generation, which limits their ability to process the entire video in an end-to-end manner. Traditional approaches, on the other hand, follow a complex two-step paradigm that fails to capture the global context, leading to suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance. For more information, please visit: https://vpx-ecnu.github.io/TimeSoccer-Website/.

External IDs:doi:10.1145/3746027.3755077