Abstract: Due to the complexity of football matches, high-quality player policies must exhibit diverse behaviors for effective collaboration. However, in football tasks where online interactions are time-consuming and costly, it is difficult to balance training efficiency and the emergence of diversity when learning policies from scratch. Therefore, this paper proposes a novel diversity-driven offline-to-online (DOTO) multi-player policy learning method for football matches. Specifically, we design a transformer-actor-critic module that is universally applicable to both offline and online stages, enabling seamless adaptation. This module utilizes expert data for offline pre-training, which serves as the initialization for online fine-tuning. Subsequently, we introduce an across offline and online adaptive intention clustering guidance module, which learns intention representations of agents and performs intention-based clustering to construct adaptive loss functions to guide policy diversity. Besides, to mitigate the bootstrapping errors of distribution shift from offline to online, we implement an online partial random initialization mechanism, balancing the inherent conservativeness of offline learning with the flexible exploration of online learning. Extensive experiments in Google Research Football environment show that DOTO not only increases win rate to 50% which surpasses all state-of-the-art methods in the hardest 11 vs. 11 task, but also generates richer policy diversity.
External IDs:dblp:conf/ijcnn/WangPHMWY25
Loading