Fine-Grained Annotation and Multi-objective Optimization Based RLHF

Published: 2025, Last Modified: 21 Jan 2026ICIC (13) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Reinforcement Learning from Human Feedback (RLHF) relies on a reward model to align large language models with human preferences. However, existing reward model training methods primarily depend on coarse-grained preference data, overlooking fine-grained feedback signals, which makes the reward model susceptible to noise and bias. Although some methods incorporate human-annotated fine-grained feedback, existing preference datasets remain limited in both scale and diversity, restricting the application of RLHF in open-source models and hindering further exploration of more refined alignment techniques. In this work, we propose a Fine-Grained Annotation and multi-objective Optimization based RLHF (FGAORLHF) framework to enhance the alignment of human preferences. Our method introduces an automated preference annotation mechanism that evaluates model responses across multiple dimensions, including relevance, coherence, correctness, complexity, and helpfulness. Utilizing these structured annotations, we optimize the reward model by incorporating multi-objective rewards, improving its ability to capture subtle human preferences. Additionally, we integrate supervised fine-tuning (SFT) with reinforcement learning using Proximal Policy Optimization (PPO) to train a model that better aligns with human preferences. Experiments on multiple datasets evaluated by GPT-4o and human raters show that our method significantly outperforms SFT and standard RLHF in alignment quality and preference consistency. These results highlight the potential of fine-grained annotation in advancing reward modeling for RLHF.
Loading