Abstract: The application of Automatic Speech Recognition (ASR) technology in soccer enables sports analytics by extracting audio commentaries to provide insights into game events and facilitate automatic game understanding. This paper presents SoccerNet-Echoes, an extension of the SoccerNet dataset with automatically generated transcriptions of soccer game broadcasts. Generated using the Whisper model and translated with Google Translate into English when needed, these transcriptions enhance video content with textual information derived from game audio. SoccerNet-Echoes serves as a comprehensive resource for developing algorithms in action spotting, caption generation, and game summarization. Through a series of experiments, we demonstrate that combining modalities—audio, video, and text—yields mixed results on classification tasks. The combination of audio and video shows improved performance over individual modalities, while the addition of ASR text does not significantly enhance results. Additionally, our baseline summarization tasks indicate that ASR content enriches summaries, offering insights beyond event information. This multimodal dataset supports diverse applications, broadening the scope of research in sports analytics. The dataset is available at: https://github.com/SoccerNet/sn-echoes.
Loading