Towards Offline Opponent Modeling with In-context Learning

Yuheng Jing; Kai Li; Bingyun Liu; Yifan Zang; Haobo Fu; QIANG FU; Junliang Xing; Jian Cheng

Towards Offline Opponent Modeling with In-context Learning

Yuheng Jing, Kai Li, Bingyun Liu, Yifan Zang, Haobo Fu, QIANG FU, Junliang Xing, Jian Cheng

Published: 16 Jan 2024, Last Modified: 07 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Opponent Modeling, Offline, Transformer

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Opponent modeling aims at learning the opponent's behaviors, goals, or beliefs to reduce the uncertainty of the competitive environment and assist decision-making. Existing work has mostly focused on learning opponent models online, which is impractical and inefficient in practical scenarios. To this end, we formalize an Offline Opponent Modeling (OOM) problem with the objective of utilizing pre-collected offline datasets to learn opponent models that characterize the opponent from the viewpoint of the controlled agent, which aids in adapting to the unknown fixed policies of the opponent. Drawing on the promises of the Transformers for decision-making, we introduce a general approach, Transformer Against Opponent (TAO), for OOM. Essentially, TAO tackles the problem by harnessing the full potential of the supervised pre-trained Transformers' in-context learning capabilities. The foundation of TAO lies in three stages: an innovative offline policy embedding learning stage, an offline opponent-aware response policy training stage, and a deployment stage for opponent adaptation with in-context learning. Theoretical analysis establishes TAO's equivalence to Bayesian posterior sampling in opponent modeling and guarantees TAO's convergence in opponent policy recognition. Extensive experiments and ablation studies on competitive environments with sparse and dense rewards demonstrate the impressive performance of TAO. Our approach manifests remarkable prowess for fast adaptation, especially in the face of unseen opponent policies, confirming its in-context learning potency.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: reinforcement learning

Submission Number: 40

Loading