Fairness Aware Reward Optimization

ICLR 2026 Conference Submission3445 Authors

09 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language models, fairness, algorithmic fairness, preferences, RLHF
TL;DR: We reduce LLM bias and toxicity by constraining the reward model to be independent of sensitive attributes, conditional on unrestricted features.
Abstract: LLMs are typically aligned with human feedback via reward models but demographic skews and group-dependent disagreements in annotations can propagate systematic unfairness. We introduce Fairness-Aware Reward Optimisation (FARO), a principled framework for training reward models under demographic parity, equalised odds, or counterfactual fairness constraints. Our approach instantiates a proxy-Lagrangian descent–ascent game (ProxyGDA) that yields reward models with provable fairness certificates up to vanishing slack. We provide the first theoretical analysis of reward-level fairness in alignment, establishing: (i) guarantees that FARO-trained rewards satisfy DP/EO/CF; (ii) a formal accuracy–fairness trade-off induced by KL-regularised RL fine-tuning; and (iii) existence of Pareto-optimal solutions along this trade-off. Across multiple LLMs on the representative BBQ dataset, FARO consistently reduces demographic bias and harmful generations while preserving or improving LLM quality and factuality.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 3445
Loading