BalancedDPO: Adaptive Multi-Metric Alignment

TMLR Paper6117 Authors

06 Oct 2025 (modified: 03 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion models have achieved remarkable progress in text-to-image generation, yet aligning them with human preference remains challenging due to the presence of multiple, sometimes conflicting, evaluation metrics (e.g., semantic consistency, aesthetics, and human preference scores). Existing alignment methods typically optimize for a single metric or rely on scalar- ized reward aggregation, which can bias the model toward specific evaluation criteria. To address this challenge, we propose BalancedDPO, a framework that achieves multi-metric preference alignment within the Direct Preference Optimization (DPO) paradigm. Unlike prior DPO variants that rely on a single metric, BalancedDPO introduces a majority-vote consensus over multiple preference scorers and integrates it directly into the DPO training loop with dynamic reference model updates. This consensus-based formulation avoids reward- scale conflicts and ensures more stable gradient directions across heterogeneous metrics. Experiments on Pick-a-Pic, PartiPrompt, and HPD datasets demonstrate that Balanced- DPO consistently improves preference win rates over the baselines across Stable Diffusion 1.5, Stable Diffusion 2.1 and SDXL backbones. Comprehensive ablations further validate the benefits of majority-vote aggregation and dynamic reference updating, highlighting the method’s robustness and generalizability across diverse alignment settings.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: For reviewer 1, we have performed the following revisions: (1) Section 4.2 (Effectiveness of different score methods) now includes a dedicated paragraph formally defining the Random Score, Vanilla Aggregation, and Normalized Aggregation baselines to ensure reproducibility; (2) Section 4.2 (Effectiveness of Balanced Multimetric Alignment and Reference Model update) has been expanded with a theoretical discussion on the "moving anchor" effect of pref​ updates within a multi-metric consensus manifold; and (3) Section 4.2 (Qualitative Analysis of Metric Bias) has been added to explicitly interpret the metric-specific biases visible in Figure 2, clarifying how BalancedDPO navigates the Pareto front between aesthetic flair and semantic adherence. For reviewer 2, we have performed the following revisions: (1) **Section 4.2 (Effectiveness of different score methods)** now includes a dedicated paragraph formally defining the *Random Score*, *Vanilla Aggregation*, and *Normalized Aggregation* baselines to ensure reproducibility; (2) **Section 4.2 (Effectiveness of Balanced Multimetric Alignment and Reference Model update)** has been expanded with a theoretical discussion on the "moving anchor" effect of pref​ updates within a multi-metric consensus manifold; and (3) **Section 4.2 (Qualitative Analysis of Metric Bias)** has been added to explicitly interpret the metric-specific biases visible in **Figure 2**, clarifying how BalancedDPO navigates the Pareto front between aesthetic flair and semantic adherence. For reviewer 3, we have implemented the following revisions to address the reviewer's concerns: (1) We added **Section C.3 to the Appendix**, providing new empirical validation on transformer-based architectures (**SD3-Medium**), proving that BalancedDPO is architecture-agnostic; (2) **Section 3.3** now includes a discussion on the **Flexibility of the Voting Schema**, clarifying that our framework inherently supports weighted voting for prioritized reward models; (3) We expanded **Section 3.4** with a formal mathematical analysis of **constructive vs. destructive gradient aggregation**, providing the requested evidence for how majority voting prevents gradient cancellation; and (4) The **Figure 1 caption** was updated to explicitly designate the majority-vote winner, ensuring the framework's mechanics are immediately clear. Together, these edits reinforce the theoretical depth and modern relevance of our proposed method.
Assigned Action Editor: ~Huaxiu_Yao1
Submission Number: 6117
Loading