Track: Language Modeling
Keywords: Large Language Models, Multi-Turn Reinforcement Learning, Exploration in RL, Actor-Critic RL
TL;DR: HOPE (Hindsight Off-Policy Exploration) guides LLM agents exploration in multi-turn RL through hindsight reasoning.
Abstract: Multi-turn reinforcement learning provides a principled framework for training LLM agents, but exploration remains a key bottleneck.
Classical exploration strategies such as $\epsilon$-greedy and upper confidence bounds select random actions, failing to efficiently explore the combinatorial space of multi-turn token sequences.
Our key insight is that LLMs can use hindsight to guide exploration: by analyzing completed trajectories and proposing counterfactual actions that could have led to higher returns.
We propose HOPE (Hindsight Off-Policy Exploration), which integrates hindsight-guided exploration into both the actor and critic stages of multi-turn RL.
HOPE improves the critic's state-action coverage by generating rollouts from counterfactual actions, and steers the actor's exploration in RL by using a learned counterfactual generator to propose alternative actions.
Experimental results show that HOPE outperforms strong multi-turn RL baselines in task-oriented dialogue tasks, TwentyQuestions (success: $0.82 \rightarrow 0.97$), GuessMyCity (success: $0.68 \rightarrow 0.75)$, and tool-use dialogue task CarDealer (success: $0.72 \rightarrow 0.77$).
Serve As Reviewer: ~Huaxiaoyue_Wang1, ~Sanjiban_Choudhury3
Submission Number: 71
Loading