Subgoal-Guided Reward Shaping: Improving Preference-Based Offline Reinforcement Learning via Conditional VAEs
Keywords: Preference-based reinforcement learning, Reinforcement learning
Abstract: Offline preference-based reinforcement learning (PbRL) learns complex behaviors from human feedback without environment interaction, but suffers from reward model extrapolation errors when encountering out-of-distribution region during policy optimization. These errors arise from distributional shifts between preference-labeled training trajectories and unlabeled inference data, leading to reward misestimation and suboptimal policies. We introduce SPOT (Subgoal-based Preference Optimization Through Attention Weight), which mitigates extrapolation errors by leveraging attention-derived subgoals from preference data. SPOT regularizes the policy toward subgoals observed in preferred trajectories. This approach constrains learning within the training distribution, reducing reward model extrapolation errors. Through comprehensive experiments, we demonstrate that our subgoal-guided approach achieves superior performance compared to existing methods while reducing extrapolation errors. Our approach preserves fine-grained credit assignment information while enhancing query efficiency, suggesting promising directions for reliable and practical offline preference-based learning.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 16068
Loading