Keywords: Game, LLM, Multi Turns Environment, Prompt Sensitivity
TL;DR: Multi-LLM game benchmarks are brittle; COPER stabilizes play by evolving context and replaying experience, nearly doubling win rates without tuning model weights.
Abstract: Recent multi-player game benchmarks can be sensitive: modest changes to role, system, or judge prompts often flip win-rate rankings under identical decoding; and static, read-once descriptions fail to impart the game-specific priors (rules, legality, action→transition effects) needed for consistent play. We document this context-induced instability and argue evaluation should be agentic: let interaction surface and solidify priors, then evaluate models for both their strength (performance) and reliability (consistency under perturbations). To establish more reliable baselines, we present COPER, a backbone-agnostic, tuning-free self-play recipe that (i) evolves prompts using a conservative TrueSkill lower-confidence bound, (ii) writes structured reflections into a persistent experience bank retrieved across turns to supply rule-aware priors, and (iii) uses prioritized replay to revisit rare, informative states for sample-efficient stabilization. Across five text games, COPER raises mean win rate from 24.9% → 49.5% (GPT-4o-mini) and 21.7% → 44.3% (Qwen-2.5-7B-Instruct) with a small budget (5×400 self-play games per task), and stabilizes agent performance under evaluation. These results show that much of today's LLM game headroom can be unlocked by context rather than weight updates, with COPER yielding strong improvements in negotiation games, competitive results in some imperfect-information settings, and RL remaining more effective in perfect-information games.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 12803
Loading