LLM Coaching LLM in Self-Play Training

Ju Qi; Xiaoxi Mao; Jianfeng Wang; Na Mou

LLM Coaching LLM in Self-Play Training

Ju Qi, Xiaoxi Mao, Jianfeng Wang, Na Mou

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Post-training, Game, Self-play

Abstract: Large language models (LLMs) have achieved impressive progress in domains such as mathematics and programming, supported by verifiable outputs and robust benchmarks. However, their potential in games—longstanding benchmarks for reinforcement learning (RL) research—remains underexplored. The most prominent challenge is no ground truth–high variance, especially in complex and strategic games like Texas Hold'em, where naively optimizing for the highest observed payoff risks training dead loops, while payoff estimation itself is highly resource-intensive. To address these issues, we propose the LLM Coach, which transforms raw self-play (SP) payoffs into class-wise reward functions by leveraging payoff data, state information, and the current policy. This design stabilizes training and accelerates learning. Within an RL+SP framework, our Qwen2.5-32B agent significantly outperforms strong baselines (e.g., Grok4, GPT-o3) in Texas Hold'em Poker, while also exhibiting improvements in broader capabilities.

Primary Area: reinforcement learning

Submission Number: 4068

Loading