Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning

Published: 01 Jul 2025, Last Modified: 01 Jul 2025RLBrew: Ingredients for Developing Generalist Agents workshop (RLC 2025)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Behavioral Cloning, Preference-Based Reinforcement Learning, Reinforcement Learning
TL;DR: We show theoretical guarantees for a novel two-stage reinforcement learning method that first learns an optimal policy estimate from an offline, expert dataset, and then refines the estimate via online preference-based human feedback.
Abstract: Deploying reinforcement-learning (RL) controllers in robotics, industry, and health care is blocked by two coupled obstacles: reward misspecification (informal goals are hard to encode as a safe numeric signal) and data-hungry exploration. We tackle these issues with a two-stage framework that begins from a reward-free dataset of expert demonstrations and refines the policy online using preference-based human feedback. We give the first principled analysis of this two-stage paradigm. In our work, we formulate a unified algorithm that (i) clones demonstrations offline to obtain a safe warm-start policy and (ii) fine-tunes it online with preference-based RL, integrating the two signals through an uncertainty-weighted objective. Then, we derive regret bounds that shrink with the demonstration counts and reflect reduced uncertainty.
Submission Number: 28
Loading