Decision Preference Alignment for Large-Scale Agents Based on Reward Model Generation

Zheng Jiaoling; Xu Weifeng; Luo Qian; Dang Wanli; Geng Long; Gao Guokang; Ren Yulin; Fan Xingyu

Decision Preference Alignment for Large-Scale Agents Based on Reward Model Generation

Zheng Jiaoling, Xu Weifeng, Luo Qian, Dang Wanli, Geng Long, Gao Guokang, Ren Yulin, Fan Xingyu

Published: 22 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Preference Alignment, Reward Model, Diffusion Model

TL;DR: This paper presents a novel data generation method for large scale agents' decision preference alignment.

Abstract: This paper presents a novel data generation method for large scale agents' decision preference alignment. Despite the recent increasing attention to AI alignment, machine learning approaches for AI alignment are still challenging due to the lack of data. Trajectory data representing agent behavior is essential in various alignment methods such as Reinforcement Learning from Human Feedback(RLHF) or Inverse Reinforcement Learning(IRL). In this paper, we significantly reduces the dependence on trajectory data. Our method uses a generative method to shift the focus from learning about the reward model for alignment to learning how to generate sample data for alignment. Therefore, our method broadens the scope of data needed for alignment to include both microscopic and macroscopic information that can be obtained. By designing detailed macro and micro metrics, it is verified that the simulation results of passenger boarding process based on generated decision preferences match well with those guided by ground truth decision preferences.

Submission Type: Long Paper (9 Pages)

Archival Option: This is an archival submission

Presentation Venue Preference: ICLR 2025

Submission Number: 60

Loading