Coherent Off-Policy Improvement of Large Behaviour Models with Learned Rewards

Published: 08 May 2026, Last Modified: 11 May 2026ICRA 2026 Workshop RL4IL PosterEveryoneRevisionsCC BY 4.0
Keywords: inverse reinforcement learning, behavioural cloning, large behavioural models
TL;DR: Use IRL to finetune VLAs
Abstract: Distilling expert demonstration data into large generative models using behavioural cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to fine-tune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has fine-tuned large pre-trained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pre-trained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behaviour in a sample-efficient manner. We look at inverse reinforcement learning, where a dense reward function is learned from the expert demonstrations, potentially reducing the challenge of RL fine-tuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method improves the performance of $\pi_0.5$ on all 6 sparse manipulation tasks and achieves a $\geq 90\%$ success rate on 5 out of 6 complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL fine-tuning and enables faster improvement.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 9
Loading