Programmatic Reinforcement Learning without Oracles

Wenjie Qiu; He Zhu

Programmatic Reinforcement Learning without Oracles

Wenjie Qiu, He Zhu

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SpotlightReaders: Everyone

Keywords: Reinforcement Learning, Programmatic Reinforcement Learning, Compositional Reinforcement Learning, Program Synthesis, Differentiable Architecture Search

Abstract: Deep reinforcement learning (RL) has led to encouraging successes in many challenging control tasks. However, a deep RL model lacks interpretability due to the difficulty of identifying how the model's control logic relates to its network structure. Programmatic policies structured in more interpretable representations emerge as a promising solution. Yet two shortcomings remain: First, synthesizing programmatic policies requires optimizing over the discrete and non-differentiable search space of program architectures. Previous works are suboptimal because they only enumerate program architectures greedily guided by a pretrained RL oracle. Second, these works do not exploit compositionality, an important programming concept, to reuse and compose primitive functions to form a complex function for new tasks. Our first contribution is a programmatically interpretable RL framework that conducts program architecture search on top of a continuous relaxation of the architecture space defined by programming language grammar rules. Our algorithm allows policy architectures to be learned with policy parameters via bilevel optimization using efficient policy-gradient methods, and thus does not require a pretrained oracle. Our second contribution is improving programmatic policies to support compositionality by integrating primitive functions learned to grasp task-agnostic skills as a composite program to solve novel RL problems. Experiment results demonstrate that our algorithm excels in discovering optimal programmatic policies that are highly interpretable. The code of this work is available at https://github.com/RU-Automated-Reasoning-Group/pi-PRL.

One-sentence Summary: We present a differentiable program architecture search framework to synthesize interpretable, generalizable, and compositional programs for controlling reinforcement learning applications.

19 Replies

Loading