Keywords: multi-objective alignment; process reward model
Abstract: Aligning large language models to human preferences is inherently multidimensional, yet most large language model alignment pipelines reduce diverse human preferences to a single objective. We address simultaneous alignment across domains with verifiable rewards, non-verifiable subjective preferences, and complex interactions. Our framework unifies process reward model (PRM) training across these settings, applies $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{DPO}$ (MAH-DPO) with a vectorized reward to capture multiple objectives, and enables fine-grained user control at inference time. Our experiments show improved multi-objective performance with reduced trade-offs and enhanced controllability.
Submission Number: 137
Loading