Optimism via Intrinsic Rewards: Scalable and Principled Exploration for Model-based Reinforcement Learning
Track: full paper
Keywords: Reinforcement Learning (RL) Theory, Deep RL, Regret bounds for RL, Robotics
TL;DR: Simple, scalable and efficient RL algorithm with regret bounds for general RL settings, state-based, visual control and hardware experiments.
Abstract: We address the challenge of efficient exploration in model-based reinforcement learning (MBRL), where the system dynamics are unknown and the RL agent must learn directly from online interactions. We propose **O**ptimistic-**MBRL** (OMBRL), an approach based on the principle of optimism in the face of uncertainty. OMBRL learns an uncertainty-aware dynamics model and greedily maximizes a weighted sum of the extrinsic reward and the agent's epistemic uncertainty. Under common regularity assumptions on the system, we show that OMBRL has sublinear regret for nonlinear dynamics in the (i) finite-horizon, (ii) discounted infinite-horizon, and (iii) non-episodic setting. Additionally, OMBRL offers a flexible and scalable solution for principled exploration. We evaluate OMBRL on state-based and visual-control environments, where it displays favorable performance across all tasks and baselines. In hardware experiments on a dynamic RC car, OMBRL outperforms the state-of-the-art, illustrating the benefits of principled exploration for MBRL.
Supplementary Material: zip
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Bhavya_Sukhija1
Format: Yes, the presenting author will definitely attend in person because they are attending ICLR for other complementary reasons.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding availability would significantly influence their ability to attend the workshop in person.
Submission Number: 39
Loading