Improving Reward Model Generalization from Adversarial Process Enhanced Preferences

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In sequential decision-making, the reward function serves as the primary supervision signal, guiding agents to acquire the desired behaviors. Traditional reward modeling methods rely heavily on human expertise, limiting their scalability. Automated preference generation from suboptimal demonstrations has emerged as a promising alternative to address this limitation. This approach first generates preference data from suboptimal demonstrations and then trains reward models based on these preferences. Despite its potential, existing methods often struggle to generate preference data with sufficient coverage, limiting the accuracy and generalizability of the resulting reward models. To overcome this limitation, we propose APEC (Automated Preference generation with Enhanced Coverage), a novel method that improves the coverage of preference data. APEC achieves this by selecting policy pairs with significantly different iteration indices from the whole adversarial imitation learning process. We provide a theoretical analysis to validate that the selected policy pairs provably hold preference relationships. Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance.
Lay Summary: In reinforcement learning (RL), designing effective reward functions is a major challenge. Poorly designed rewards can lead to agents "hacking" the reward or requiring extensive human expertise to refine. Traditional methods like manual reward design, imitation learning, or preference-based learning often rely on perfect expert demonstrations or excessive human feedback, limiting their use for complex tasks. Our study addresses a key question: How can we automatically generate diverse, high-quality preferences without human input to train more robust reward models? We introduce APEC (Automated Preference Generation with Enhanced Coverage), inspired by adversarial imitation learning (AIL). In AIL, we observed that policies naturally improve over training iterations. APEC leverages this insight: by selecting policy pairs from different training stages, we can automatically generate preferences. Unlike prior methods that add random noise to a single policy, APEC’s approach ensures preferences cover a wide range of behaviors, making reward models more accurate and robust. Testing APEC on 8 continuous control tasks (including both simple and pixel-based robot movements), we found that it outperforms baselines. By harnessing the natural progression of AIL, APEC generates diverse preference data that aligns reward models closer to real-world objectives. This reduces reliance on human expertise, speeds up learning, and enables agents to master complex skills with minimal handholding. **Why does this matter?** As RL advances in robotics, large language models and autonomous systems, minimizing human involvement in reward design is critical for scalability. APEC offers a path to more independent, reliable AI that learns meaningful behaviors through automated preference generation, bringing us closer to practical, real-world AI applications that require less constant human guidance.
Link To Code: https://github.com/Zzl35/APEC
Primary Area: Reinforcement Learning
Keywords: reward modeling, preference generation, preference-based reinforcement learning, adversarial imitation learning
Submission Number: 5802
Loading