Keywords: Preference Alignment, Reward Model, Diffusion Model
TL;DR: This paper presents a novel data generation method for large scale agents' decision preference alignment.
Abstract: This paper presents a novel data generation method for large scale agents' decision preference alignment. Despite the recent increasing attention to AI alignment, machine learning approaches for AI alignment are still challenging due to the lack of data. Trajectory data representing agent behavior is essential in various alignment methods such as Reinforcement Learning from Human Feedback(RLHF) or Inverse Reinforcement Learning(IRL). In this paper, we significantly reduces the dependence on trajectory data. Our method uses a generative method to shift the focus from learning about the reward model for alignment to learning how to generate sample data for alignment. Therefore, our method broadens the scope of data needed for alignment to include both microscopic and macroscopic information that can be obtained. By designing detailed macro and micro metrics, it is verified that the simulation results of passenger boarding process based on generated decision preferences match well with those guided by ground truth decision preferences.
Submission Type: Long Paper (9 Pages)
Archival Option: This is an archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 60
Loading