AMPO: Active Multi Preference Optimization for Self-play Preference Selection

Taneesh Gupta; Rahul Madhavan; Xuchao Zhang; Chetan Bansal; Saravan Rajmohan

AMPO: Active Multi Preference Optimization for Self-play Preference Selection

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Active Selection of on-policy generation for enhanced group-based preference optimization

Abstract: Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, making it computationally infeasible to include all of them in the training objective. We propose Active Multi-Preference Optimization (AMPO), which combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses, then pick a small but informative subset—covering reward extremes and distinct semantic clusters—for preference optimization.The resulting contrastive-training scheme identifies not only the best and worst answers but also subtle, underexplored modes crucial for robust alignment. Theoretically, we provide guarantees of expected reward maximization using our active selection method. Empirically, AMPO achieves state-of-the-art results on AlpacaEval with Llama 8B and Mistral 7B. We release our datasets [here](https://huggingface.co/Multi-preference-Optimization).

Lay Summary: Large language models (LLMs), like those used in chatbots or virtual assistants, often generate multiple possible answers to a question. But teaching these models to consistently choose the best response is tricky — especially when most training methods compare only two answers at a time. This limited approach misses valuable signals from the many other responses the model could have considered. Our research introduces Active Multi-Preference Optimization (AMPO) — a new way to train language models that looks at groups of good and bad answers instead of just pairs. We let the model generate its own possible answers, then carefully select a few that are diverse and informative. These selected responses help the model learn not only what a great answer looks like, but also what makes an answer unclear, vague, or subtly misleading. This smarter training method makes language models more accurate and aligned with human expectations. In tests on popular benchmarks, our method outperforms existing techniques — and we’ve made our data and code publicly available to help others build more reliable AI systems.

Primary Area: Deep Learning->Large Language Models

Keywords: Preference Optimization, Active Learning, Multi-Preference Optimization, RLHF

Submission Number: 4475

Loading