A Roadmap for Human-Agent Moral Alignment: Integrating Pre-defined Intrinsic Rewards and Learned Reward Models
Keywords: roadmap, moral alignment, reward models, intrinsic rewards
TL;DR: We propose an an integration of intrinsic rewards and learned reward models to creates more dynamic and user-driven alignment of foundation models.
Abstract: The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. This approach suffers from low transparency, low controllability and high cost. More recently, researchers have introduced the design of intrinsic reward functions that explicitly encode core human moral values for Reinforcement Learning-based fine-tuning of foundation agent models. This approach offers a way to explicitly define transparent values for agents, while also being cost-effective due to automated agent fine-tuning. However, its weaknesses include simplicity, lack of flexibility and the inability to dynamically adapt to the needs or preferences of (potentially diverse) users. In this position paper, we argue that a combination of intrinsic rewards and learned reward models may provide an effective way forward for alignment research that enables human agency and control. Integrating intrinsic rewards and learned reward models in post-training can allow models to act in a way that is respectful of the specific users' moral preferences while also relying on a transparent foundation of pre-defined values.
Submission Type: Short Paper (4 Pages)
Archival Option: This is an archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 96
Loading