XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning from demonstrations, behavioral cloning, offline-to-online
TL;DR: We perform off-policy RL from demonstrations with BC initialization WITHOUT unlearning the BC policy!
Abstract: A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency. Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards. While prior data is used to augment experience and pre-train models, we show the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively. We propose XQCfD, which extends the sample-efficient XQC actor-critic to learn from demonstrations, using augmented replay buffers, pre-trained policies and stationary policy architectures, designing the algorithm to avoid rapidly `unlearning' the strong initial policy like prior works. We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions. XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit, Robomimic and MimicGen benchmarks – notably, with a low update-to-data ratio and no ensemble networks.
Submission Number: 144
Loading