Direct Multi-agent Motion Generation Preference Alignment with Implicit Feedback from Demonstrations
Keywords: Alignment from demonstrations, Alignment from human feedback
TL;DR: We propose a method to align motion generation models with human preferences using implicit feedback from expert demonstrations, eliminating the need for costly human annotations.
Abstract: Recent advancements in Large Language Models (LLMs) have transformed motion generation models in embodied applications such as autonomous driving and robotic manipulation. While LLM-type motion models benefit from scalability and efficient formulation, there remains a discrepancy between their token-prediction imitation objectives and human preferences. This often results in behaviors that deviate from human-preferred demonstrations, making post-training behavior alignment crucial for generating human-preferred motions. Post-training alignment requires a large number of preference rankings over model generations, which are costly and time-consuming to annotate in multi-agent motion generation settings. Recently, there has been growing interest in using expert demonstrations to scalably build preference data for alignment. However, these methods often adopt a worst-case scenario assumption, treating all generated samples from the reference model as unpreferred and relying on expert demonstrations to directly or indirectly construct preferred generations. This approach overlooks the rich signal provided by preference rankings among the model's own generations. In this work, instead of treating all generated samples as equally unpreferred, we propose a principled approach leveraging the implicit preferences encoded in expert demonstrations to construct preference rankings among the generations produced by the reference model, offering more nuanced guidance at low-cost. We present the first investigation of direct preference alignment for multi-agent motion token-prediction models using implicit preference feedback from demonstrations. We apply our approach to large-scale traffic simulation and demonstrate its effectiveness in improving the realism of generated behaviors involving up to 128 agents, making a 1M token-prediction model comparable to state-of-the-art large models by relying solely on implicit feedback from demonstrations, without requiring additional human annotations or high computational costs. Furthermore, we provide an in-depth analysis of preference data scaling laws and their effects on over-optimization, offering valuable insights for future investigations.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13515
Loading