Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

Yifan Zhou; Sachin Grover; Mohamed El Mistiri; Kamalesh Kalirathinam; Pratyush Kerhalkar; Swaroop Mishra; Neelesh Kumar; Sanket Gaurav; Oya Aran; Heni Ben Amor

Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

Yifan Zhou, Sachin Grover, Mohamed El Mistiri, Kamalesh Kalirathinam, Pratyush Kerhalkar, Swaroop Mishra, Neelesh Kumar, Sanket Gaurav, Oya Aran, Heni Ben Amor

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Reinforcement Learning, LLM, In-Context Learning, Linguistic and Numerical Optimization

TL;DR: Prompted Policy Search uses LLMs for RL policy search, outperforming common methods across diverse tasks.

Abstract: Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augments existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop—directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, constraints, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across 15 Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on 8 out of 15 tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned reinforcement learning.

Supplementary Material: zip

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 26193

Loading