FLORA: Generalizable Motion-Flow-Based Reward Shaping for Scalable Real-World Robot Learning

20 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Shaping; Reinforcement Learning; Real-World RL; Rewards model; LLM; Motion Flow; PBRS;
TL;DR: FLORA, A generalizable motion-flow-based reward shaping framework for scalable real-world robot learning with only few demonstrations, which can guarantee optimal policy invariance.
Abstract: Rewards design is a long-standing challenge in Reinforcement Learning (RL) for robotics, particularly when scaling to real-world robot learning tasks. Generally speaking, existing reward design approaches in real-world RL rely either on sparse rewards, which provide little feedback and commonly lead to inefficient learning, or on pre-trained vision-based reward models, which typically lack theoretical guarantees and often fail in generalizing to new tasks. To address these challenges, we introduce $\textbf{F}$low-based $\textbf{L}$anguage-driven $\textbf{O}$ffline $\textbf{R}$eward $\textbf{A}$daptation ($\textbf{FLORA}$), a framework that combines strong generalization capability with a theoretical guarantee of optimal policy invariance. FLORA adopts large language models (LLMs) to automatically generate analytical reward functions for new tasks, leveraging their inherent generalization ability across diverse tasks. Unlike end-to-end neural reward models, these analytical reward functions encode task-relevant priors, enabling efficient few-shot adaptation. With only $\textbf{3–5}$ demonstrations, our proposed offline reward improvement procedure optimizes both the structure and parameters of the rewards, producing reliable signals for new tasks. To enable direct operation from raw visual inputs and eliminate the reliance on privileged states, we extract flows from images as inputs to the analytical reward functions. Furthermore, we propose a PBRS-Milestone rewards shaping structure to reformulate rewards signals, which improves practicality while preserving optimal policy invariance guarantee. Extensive experiments show that FLORA enables sample-efficient RL on new tasks, outperforming strong baselines by more than $\textbf{2×}$ in simulation, and solving complex real-world manipulation tasks in $\textbf{$\sim$20 minutes}$, where existing baselines fail even after $\textbf{60}$ minutes training. These results establish our method as a critical step towards scalable real-world robot learning.
Primary Area: reinforcement learning
Submission Number: 24618
Loading