The Road Not Taken: Hindsight Exploration for LLMs in Multi-Turn RL

Huaxiaoyue Wang; Sanjiban Choudhury

The Road Not Taken: Hindsight Exploration for LLMs in Multi-Turn RL

Huaxiaoyue Wang, Sanjiban Choudhury

Published: 12 Jun 2025, Last Modified: 10 Jul 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Language Modeling

Keywords: Large Language Models, Multi-Turn Reinforcement Learning, Exploration in RL, Actor-Critic RL

TL;DR: HOPE (Hindsight Off-Policy Exploration) guides LLM agents exploration in multi-turn RL through hindsight reasoning.

Abstract: Multi-turn reinforcement learning provides a principled framework for training LLM agents, but exploration remains a key bottleneck. Classical exploration strategies such as $\epsilon$-greedy and upper confidence bounds select random actions, failing to efficiently explore the combinatorial space of multi-turn token sequences. Our key insight is that LLMs can use hindsight to guide exploration: by analyzing completed trajectories and proposing counterfactual actions that could have led to higher returns. We propose HOPE (Hindsight Off-Policy Exploration), which integrates hindsight-guided exploration into both the actor and critic stages of multi-turn RL. HOPE improves the critic's state-action coverage by generating rollouts from counterfactual actions, and steers the actor's exploration in RL by using a learned counterfactual generator to propose alternative actions. Experimental results show that HOPE outperforms strong multi-turn RL baselines in task-oriented dialogue tasks, TwentyQuestions (success: $0.82 \rightarrow 0.97$), GuessMyCity (success: $0.68 \rightarrow 0.75)$, and tool-use dialogue task CarDealer (success: $0.72 \rightarrow 0.77$).

Serve As Reviewer: ~Huaxiaoyue_Wang1, ~Sanjiban_Choudhury3

Submission Number: 71

Loading