Keywords: Post-training, RLHF, GRPO, GDPO, Multi-objective Optimization, Nash Welfare, LLM Alignment.
TL;DR: We study the effects of covariance sphering and welfare-based aggregation to convert reward vectors into scalars for policy alignment. This resolves signal redundancy and reward hacking.
Abstract: Alignment of large language models is increasingly formulated as optimization over multiple rubric signals. These signals typically exhibit strong statistical dependencies, ranging from redundancy to anti-correlation (e.g., conciseness versus correctness), raising the question of how to robustly convert vector-valued rewards into scalar advantages.
While recent state-of-the-art methods like GDPO address scale discrepancies via per-dimension normalization, they ignore reward geometry by treating coordinates as orthogonal.
This mishandles correlations: redundant objectives are double-counted, while anti-correlated rewards are dominated by high-variance trade-off directions, and allows models to exploit easy objectives at the expense of hard constraints.
We introduce $\textbf{GEOMA}$ (Geometric and Econometric Objectives for Multi-reward Alignment), a framework that decomposes reward aggregation into geometric preconditioning via covariance sphering of reward vectors, and econometric aggregation such as Nash Welfare and SoftMin.
We formally characterize these objectives, providing theoretical guarantees for their robustness to reward hacking and signal redundancy.
Empirically, we demonstrate that GEOMA outperforms GDPO on Math reasoning and Tool Calling. On mathematical reasoning, it improves overall accuracy by 1.5\% on average while achieving $1.5 \times$ token efficiency over GDPO.
Submission Number: 114
Loading