TL;DR: We propose an exploration technique that meaningfully improves the quality of online experience collection for the reinforcement learning finetuning of generative control policies, leading to increases in finetuning sample efficiency.
Abstract: A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best balances quality with high exploration interest. In fleet settings, DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group. DF-ExpEnse can be seamlessly integrated with existing strategies that finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits through DF-ExpEnse across a variety of manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes. Project can be found at [df-expense.github.io](https://df-expense.github.io).
Lay Summary: Just as humans can continuously refine their behavior over multiple trials at a particular task of interest, so too do we wish to design intelligent decision-making agents that can dynamically adjust their behaviors and beliefs with respect to feedback on self-collected experience. Central to this question is exploration - by attempting new or uncertain actions, an agent can potentially collect valuable experience through which to quickly learn how to best interact with the environment.
In this work, we propose an exploration strategy for a decision-making agent that has some initial priors over how to behave. Firstly, the agent generates a small yet diverse set of actions that it considers reasonable to execute given its current situation. Then, it quantitatively evaluates the exploration interest of each candidate action, by balancing both the value it believes each candidate to have against how uncertain it is about the candidate's specific value. Our technique also enables individual agents to collaborate with other agents in parallel, such that exploration can be performed as a group. Thus, our method enables collaborative exploration that can quickly update an agent towards proficient decision-making with respect to tasks of interest.
Exploration techniques such as proposed in this work can help agents learn from their own self-collected experience in a more sample-efficient way, resulting in more adaptable decision-making agents. Whereas we validate our idea on agents for robotic decision-making settings, the ideas proposed here can be extended to advance general self-improvement strategies for artificially intelligent agents.
Primary Area: Reinforcement Learning->Everything Else
Keywords: Exploration, Reinforcement Learning, Robotic Fleets, Fleet Learning
Originally Submitted PDF: pdf
Submission Number: 24073
Loading