AMPO: Active Multi Preference Optimization for Self-play Preference Selection

Taneesh Gupta; Rahul Madhavan; Xuchao Zhang; Chetan Bansal; Saravan Rajmohan

AMPO: Active Multi Preference Optimization for Self-play Preference Selection

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

Published: 08 Mar 2025, Last Modified: 08 Mar 2025SSI-FM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self Play Preference Optimization, Active Learning, Group Contrastive Loss, Multi-Preference Optimization, RLHF

TL;DR: Active Selection of on-policy generated data for improved multi-preference optimization using k-medoids

Abstract: Multi-preference optimization improves DPO-style alignment beyond pairwise preferences, by contrasting entire sets of helpful and undesired responses, enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, making it computationally infeasible to include all of them in the training objective. We propose Active Multi-Preference Optimization (AMPO), which combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses, then pick a small but informative subset—covering reward extremes and distinct semantic clusters—for preference optimization.The resulting contrastive-training scheme identifies not only the best and worst answers but also subtle, underexplored modes crucial for robust alignment. Theoretically, we provide guarantees of expected reward maximization using our active selection method. Empirically, AMPO achieves state-of-the-art results on AlpacaEval with Llama 8B, achieving a 52% win-rate over GPT-4o. We release our datasets (anonymously) at [huggingface/MPO](https://huggingface.co/datasets/Multi-preference-Optimization/).

Submission Number: 48

Loading