Hoop-MSSL: Multi-Task Self-supervised Representation Learning on Basketball Spatio-Temporal Data

Published: 13 Oct 2024, Last Modified: 02 Dec 2024NeurIPS 2024 Workshop SSLEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised Learning, Contrastive learning, Representation learning, Sports analysis, Multi-agent behavior
TL;DR: HoopMSSL is a multi-task self-supervised learning framework that utilizes masking augmentation and three pre-training tasks to effectively capture spatio-temporal features and role relationships on a basketball court.
Abstract: Observing and identifying on-court behaviors by basketball players, engaging in intricate spatial-temporal interactions with their teammates and the opponent players, have long been considered challenging tasks for machines. Early approaches focused on supervised learning to capture spatial-temporal information and role relationships between players. These frameworks relied on labeled data and were unable to be generalized to other tasks. To addressed these limitations, some recent works has drawn inspiration from the field of autonomous driving to develop self-supervised learning frameworks for trajectory data. However, these frameworks mainly focus on single tasks such as trajectory reconstruction or prediction and do not take into account the domain knowledge in basketball. In this work, we propose Hoop-MSSL, a multi-task self-supervised representation learning framework to handle complex interactions and dependencies among spatial-temporal data on basketball court. Specially, Hoop-MSSL integrates masking augmentation and three pre-training tasks for (i) motion reconstruction, (ii) player-role identification and (iii) contrastive learning, to capture the spatial-temporal features and the role relationships across multiple dimensions. To evaluate the efficacy of Hoop-MSSL, we conducted extensive line-probing experiments on three downstream tasks. Our results demonstrate that the synergistic interaction among all of the Hoop-MSSL components helps the model to learn more general spatial-temporal representations, allowing it to achieve better performance on all downstream tasks as compared to using only subsets of the components. Finally, a high masking ratio (80\%) can further enhance significantly the model’s ability to learn useful representations.
Submission Number: 7
Loading