Keywords: Game Theory, Imperfect Information Games, Test-time Learning, Reinforcement Learning, Two-player Zero-sum Games, Gadget Games, Policy-Gradient Algorithms
Abstract: Recent advances in artificial intelligence have demonstrated that using additional computation during inference improves performance across domains ranging from games to language models. In this work, we explore applying additional test-time training in imperfect-information games. We focus on two-player zero-sum games and show that modern policy-gradient algorithms, which converge to equilibria when applied to the full game, can produce highly exploitable strategies when applied locally at test time. This phenomenon, previously observed in tabular settings, persists in the function approximation regime. We extend safe subgame-solving techniques based on gadget games from the tabular setting to reinforcement learning and show that they can prevent this degradation. Scaling these methods to more complex domains may require learned generative models of the environment, as test-time training demands the ability to generate states and trajectories on the fly, and restarting a simulator from the current position may not always be feasible. More broadly, our findings are relevant beyond games, as LLM-based agents are increasingly trained via reinforcement learning, and similar safeguards may be necessary when trained in an adversarial setting.
Submission Number: 71
Loading