Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We provide a reward-free algorithmic framework for multi-objective and multi-group RLHF with general non-linear aggregation.
Abstract: Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning models, particularly Language Models (LMs) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices, e.g. compare two papers on their novelty, clarity, correctness, etc. Multi-Objective RLHF aims to use per-objective preference feedback and achieve a Pareto optimal tradeoff among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.
Lay Summary: This paper explores how to better align large language models with human preferences. People often find it easier to evaluate individual aspects—such as fluency, safety, and helpfulness—rather than providing an overall preference. Therefore, effectively aggregating human feedback across these objectives is key to building more aligned models. In this work, we consider weighted p-norm aggregation as a flexible way to combine per-objective reward models, and we introduce a projection-based optimization algorithm to maximize this aggregation. We further extend our approach to multi-group settings, where each group may have different preferences and aggregation methods. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.
Primary Area: Theory->Learning Theory
Keywords: multi-objective learning, RLHF theory
Submission Number: 14279
Loading