Keywords: Preference-conditioned alignment, Evolving preference goals, Multi-objective reinforcement learning
Abstract: Deployed AI agents increasingly face evolving preference goals: user intent shifts, contexts change acceptable risk, and constraints update over time, so a single deployed LLM policy should re-target behavior on the fly without weight updates at deployment. Standard Reinforcement Learning from Human Feedback (RLHF) collapses multiple objectives into a single scalar reward, yielding brittle trade-offs. Meanwhile, common preference-conditioned LLM alignment pipelines often sample one preference per update and rely on linear scalarization, which can (i) weaken sensitivity to the preference signal through interference across conflicting updates and (ii) under-cover non-convex trade-off regions. We propose MERIDIAN (Meta-Learning for Preference-Conditioned Alignment), a bi-level framework that treats each preference as a separate alignment task: an inner loop performs preference-specific adaptation in isolation, and a first-order Reptile-style outer update consolidates the adapted parameters to preserve steerability across the preference simplex. We pair this with a smoothed Tchebycheff scalarization to improve coverage of non-convex trade-off regions. Empirically, MERIDIAN achieves denser Pareto coverage, better access to extreme goal modes, and improved performance on unseen preferences, supporting inference-time goal re-targeting. We also provide a generalization result showing how optimizing an empirical objective over sampled preferences can transfer to unseen preferences.
Submission Number: 223
Loading