Opponent Simulation as Inference-time Scaling for Self-improving Agent: Case Study of Repeated Negotiations

Opponent Simulation as Inference-time Scaling for Self-improving Agent: Case Study of Repeated Negotiations

ICLR 2026 Conference Submission14946 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: strategic reasoning, llm negotiations, inference-time techniques

TL;DR: We develop a framework for improving LLMs agents in strategic reasoning and decision-making tasks

Abstract: Large language models (LLMs) have recently emerged as powerful decision-makers across a wide range of reasoning-intensive tasks. While prior work has made great progress in single-agent environments, less effort has been devoted to settings where LLMs must engage in \emph{repeated} and \emph{strategic} interactions without prior knowledge about the opponents. In such settings, traditional self-play or offline training, though robust against worst-case adversaries, do not fully leverage the flexibility of LLMs to continually self-improve based on interaction feedback. To address this, we introduce a general inference-time framework called best-of-$N$ sampling with opponent simulation (\ours), with a case study in repeated negotiation games. The framework scales inference-time computation by embedding the principles of a classical game-theoretical learning dynamic, \emph{fictitious play (FP)}, into practical LLM implementations: (i) for the belief formation step, we introduce a separate LLM as an opponent model that in-context learns to imitate the \emph{time-averaged} behavior of the opponent from past interactions; (ii) for the best response step, we perform BoN by simulating future outcomes using the opponent model, where candidates are generated through a structured strategic brainstorming process. Empirical evaluations on two repeated negotiation games, the buyer-seller negotiation and the resource exchange negotiation, demonstrate that our method achieves significant self-improvement over repeated interaction compared with various baselines, offering a lightweight and scalable approach to strategic reasoning and decision-making.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14946

Loading