Keywords: online policy learning, behavior prior distillation, bidirectional knowledge flow
TL;DR: We propose a Bidirectional Behavior Prior Distillation (B2PD) algorit
Abstract: Existing behavior prior reinforcement learning (BPRL) algorithms predominantly rely on offline pre-training, where a behavior cloning model is learned from offline datasets, and policy priors are used to guide the online fine-tuning of the agent. However, the limited quality of offline datasets often hinders the ability to provide high-value policies that can effectively guide policy updates. The absence of expert trajectories significantly impairs online policy learning, leading to low sample efficiency and suboptimal performance. To address these challenges, we depart from conventional behavior prior approaches and propose a Bidirectional Behavior Prior Distillation (B2PD) algorithm. B2PD leverages action-value priors to guide a conditional variational autoencoder (CVAE) in generating a high-value behavior support set. The resulting expert behavior priors are further distilled into the agent, effectively reducing inefficient exploration and enabling stable policy optimization, while establishing a bidirectional knowledge flow mechanism. Experimental results on across both state- and pixel-based environments demonstrate that B2PD significantly improves both sample efficiency and overall performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 6051
Loading