Residual Off-Policy RL for Finetuning Behavior Cloning Policies

Lars Ankile; Zhenyu Jiang; Rocky Duan; Guanya Shi; Pieter Abbeel; Anusha Nagabandi

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, Anusha Nagabandi

Published: 05 Mar 2026, Last Modified: 17 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsvalue

Keywords: reinforcement learning, behavior cloning, real-world RL, residual learning, robot manipulation, sample efficiency

TL;DR: We finetune behavior cloning policies with off-policy residual RL, demonstrating real-world RL on a dexterous bimanual humanoid using only sparse rewards

Abstract: Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 92

Loading