Keywords: LLM-based agent, learn to explore, reinforcement learning
TL;DR: We propose a framework which enables agents to explore environmental implicit rules at test time, and train a specialized thinker to improve its performance.
Abstract: With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules—hidden constraints that cannot be observed directly and must be inferred through interaction.
This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure.
To address this challenge, we propose **Test-Time Exploration (TTExplore)**, a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. As training a thinker is challenged due to sparse task rewards, we introduce a novel training pipeline for stable reinforcement learning by incorporating techniques such as task decomposition and difficulty filtering. Using this pipeline, we train a specialized 7B model, **Exp-Thinker**. Evaluated on five text-based embodied tasks, TTExplore with our trained Exp-Thinker significantly improves baseline agent scores by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 9201
Loading