The Road Not Taken: Hindsight Exploration for LLMs in Multi-Turn RL

Published: 12 Jun 2025, Last Modified: 21 Jun 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Language Modeling
Keywords: Large Language Models, Multi-Turn Reinforcement Learning, Exploration in RL, Actor-Critic RL
TL;DR: HOPE (Hindsight Off-Policy Exploration) guides LLM agents exploration in multi-turn RL through hindsight reasoning.
Abstract: Multi-turn reinforcement learning provides a principled framework for training LLM agents, but exploration remains a key bottleneck. Classical exploration strategies such as $\epsilon$-greedy and upper confidence bounds select random actions, failing to efficiently explore the combinatorial space of multi-turn token sequences. Our key insight is that LLMs can use hindsight to guide exploration: by analyzing completed trajectories and proposing counterfactual actions that could have led to higher returns. We propose HOPE (Hindsight Off-Policy Exploration), which integrates hindsight-guided exploration into both the actor and critic stages of multi-turn RL. HOPE improves the critic's state-action coverage by generating rollouts from counterfactual actions, and steers the actor's exploration in RL by using a learned counterfactual generator to propose alternative actions. Experimental results show that HOPE outperforms strong multi-turn RL baselines in task-oriented dialogue tasks, TwentyQuestions (success: $0.82 \rightarrow 0.97$), GuessMyCity (success: $0.68 \rightarrow 0.75)$, and tool-use dialogue task CarDealer (success: $0.72 \rightarrow 0.77$).
Serve As Reviewer: ~Huaxiaoyue_Wang1, ~Sanjiban_Choudhury3
Submission Number: 71
Loading