Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement

Yuheng Jing; Kai Li; Bingyun Liu; Ziwen Zhang; Haobo Fu; QIANG FU; Junliang Xing; Jian Cheng

Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement

Yuheng Jing, Kai Li, Bingyun Liu, Ziwen Zhang, Haobo Fu, QIANG FU, Junliang Xing, Jian Cheng

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose TIPR, a plug-and-play framework that instantly refines policies during test time using an in-context horizon-truncated Q function, effectively improving offline opponent modeling algorithms trained on suboptimal datasets.

Abstract: Offline Opponent Modeling (OOM) aims to learn an adaptive autonomous agent policy that dynamically adapts to opponents using an offline dataset from multi-agent games. Previous work assumes that the dataset is optimal. However, this assumption is difficult to satisfy in the real world. When the dataset is suboptimal, existing approaches struggle to work. To tackle this issue, we propose a simple and general algorithmic improvement framework, Truncated Q-driven Instant Policy Refinement (TIPR), to handle the suboptimality of OOM algorithms induced by datasets. The TIPR framework is plug-and-play in nature. Compared to original OOM algorithms, it requires only two extra steps: (1) Learn a horizon-truncated in-context action-value function, namely Truncated Q, using the offline dataset. The Truncated Q estimates the expected return within a fixed, truncated horizon and is conditioned on opponent information. (2) Use the learned Truncated Q to instantly decide whether to perform policy refinement and to generate policy after refinement during testing. Theoretically, we analyze the rationale of Truncated Q from the perspective of No Maximization Bias probability. Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets.

Lay Summary: In competitive multi-agent games, AI agents often learn from past interactions to anticipate and respond to opponents. However, when the available data is imperfect or suboptimal, existing learning approaches can falter. Our research introduces a plug-and-play framework called Truncated Q-driven Instant Policy Refinement (TIPR), which enhances these learning approaches by enabling agents to refine their policies in real-time during testing, even when trained on less-than-ideal data. We demonstrate that TIPR significantly improves agent performance across various competitive scenarios, making AI more adaptable and effective in real-world applications where perfect data is rarely available.

Primary Area: Reinforcement Learning->Everything Else

Keywords: Opponent Modeling, Offline, In-context Learning

Submission Number: 747

Loading