A Tale of Two Problems: Multi-Objective Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

A Tale of Two Problems: Multi-Objective Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

ICLR 2026 Conference Submission13972 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-objective optimization, Bilevel optimization, Preference.

Abstract: In recent years, bilevel optimization (BLO) has attracted significant attention for its broad applications in machine learning. However, most existing works on BLO remain confined to the single-objective setting and rely on the lower-level strong convexity assumption, which significantly restricts their applicability to modern machine learning problems of growing complexity. In this paper, we make the first attempt to extend BLO to the multi-objective setting under a relaxed lower-level general convexity (LLGC) assumption. To this end, we reformulate the multi-objective bilevel learning (MOBL) problem with LLGC into an equality constrained multi-objective optimization (ECMO) problem. This transformation yields a single-level formulation that is more amenable to algorithm design while preserving the optimal solutions of the original MOBL problem. However, ECMO itself is a new problem that has not yet been studied in the literature, with no existing results on its algorithmic design or theoretical analysis, and without a formally established convergence metric. To address this gap, we first establish a new Karush–Kuhn–Tucker (KKT)-based Pareto stationarity as the convergence criterion for ECMO algorithm design. Based on this foundation, we propose a weighted Chebyshev (WC)-penalty algorithm that achieves a finite-time convergence rate of $\mathcal{O}(ST^{-\frac{1}{2}})$ to KKT-based Pareto stationarity in both deterministic and stochastic settings, where $S$ denotes the number of objectives, and $T$ is the total iterations. Moreover, by varying the preference vector over the $S$-dimensional simplex, our WC-penalty method systematically explores the Pareto front. Finally, solutions to the ECMO problem translate directly into solutions for the original MOBL problem, thereby closing the loop between these two foundational optimization frameworks. We verify the efficacy of our approach through experiments on multi-objective data weighting in reinforcement learning from human feedback (RLHF) reward model training and large language model (LLM) alignment.

Supplementary Material: zip

Primary Area: optimization

Submission Number: 13972

Loading