Keywords: Post-training;Multi-reward;Agent;
Abstract: Post-training preference optimization is a central approach for aligning text-to-image models with human preferences.
Recent work attempts to mitigate reward hacking by jointly optimizing multiple reward models, under the assumption that richer objectives provide more constrained guidance.
However, we find that multi-reward optimization does not inherently prevent reward hacking, and its effectiveness critically depends on the reliability of the underlying reward models. Moreover, the inherent trade-offs among multiple rewards call for principled multi-objective optimization algorithms.
To address this challenge, we propose a Pareto-frontier-guided optimal transport framework for robust multi-reward optimization. Our method dynamically constructs Pareto frontiers during training and maps dominated samples toward the frontier via distribution-aware optimal transport, and can be applied to arbitrary sets of reward models. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show our approach outperforms baselines with an 11\% gain in JDR and achieves a near 80\% win rate in human evaluations.
Submission Number: 56
Loading