FoPO: Foresight Policy Optimization Incentivizes Strategic Reasoning in LLMs

16 Sept 2025 (modified: 03 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: NLP; strategic reasoning
Abstract: Recent breakthroughs in Large Language Models (LLMs) have promoted the widespread use of AI agents in diverse social scenarios (e.g., WereWolf, Diplomacy), due to their remarkable reasoning ability. Specifically, strategic reasoning plays a pivotal role that enables agents to communicate, cooperate, and compete with counterparts, thereby facilitating more foresighted decision-making. Existing approaches for strategic reasoning in LLMs usually have devoted far limited attention to the vital aspect of foresight. In this paper, we introduce a novel method, termed \textbf{Fo}resight \textbf{P}olicy \textbf{O}ptimization (FoPO), that extends original proximal policy optimization (PPO) with a correction term to guide a foresighted strategy. Our method encourages agents to consider both self-oriented outcomes and the potential behaviors and rewards from their counterparts, ultimately enhancing genuine strategic foresight. To this end, we further propose a new curated dataset to require AI agents to forecast the possible actions from the counterpart, which comprises two game-theoretic tasks from the perspective of cooperation and competition. By employing FoPO as a self-play fashion, we conduct various LLMs from different sources and sizes to validate our method and our dataset in multi-agent interaction environments. Experimental results confirm that our proposed method can effectively enhance the strategic reasoning in AI agents, validating the importance of enhancing the foresight ability of agents in multi-agent environments.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7918
Loading