Keywords: Reinforcement learning, Vision language action models, Exploration
Abstract: Reinforcement learning has become the dominant post-training paradigm for foundation models, but self-improvement is bottlenecked by exploration. Standard action-space perturbations induce only local exploration, whereas many tasks demand strategies globally different from the current policy — especially when fine-tuning generalist policies on hard exploration tasks with near-zero initial success. We propose perturbing task descriptions instead of actions: prompt perturbation changes what task the model is told to do, steering it toward qualitatively distinct strategies. We cast this as posterior sampling in prompt space, where a distribution over prompts implicitly defines a distribution over policies through the pretrained language prior. To update this distribution without gradient training, we use a vision-language model as both a sampler of plausible prompts and a reasoner that shifts probability toward prompts that elicit success, given observed trajectories. We call the resulting algorithm Prompt-Driven Exploration (PDE). On hard exploration tasks with near-zero initial success, PDE attains higher success rates with far fewer environment interactions than action-space exploration.
Submission Number: 49
Loading