Barriers to Pareto Steerability in Preference-Conditioned LLM Alignment

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: LLM Alignment, Multi-Objective Optimization, Pareto Steerability
Abstract: Post-training alignment of Large Language Models (LLMs) is inherently a multi-objective problem, yet standard paradigms often collapse conflicting goals into a single "one-size-fits-all" reward scalar. While preference-conditioned alignment aims to give users dynamic control over trade-offs, achieving broad and continuous steerability across the Pareto frontier remains challenging in practice. In this paper, we investigate the limitations of current state-of-the-art methods on the Helpfulness vs. Harmlessness (HH) task and identify two recurring failure modes: an Optimization Gap, where conflicting gradients lead to fragmented behavior across preferences, and a Geometric Gap, where linear scalarization cannot recover non-convex regions of the trade-off space. Through a sequence of systematic experiments, we show how these two gaps manifest in practice and demonstrate that addressing only one of them is insufficient in the setting studied here. Finally, we evaluate a framework that combines meta-learning with geometry-aware scalarization, aiming to address both gaps in a unified way.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 70
Loading