ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization
Abstract: Recent advances in reinforcement learning from human feedback (RLHF) and autoregressive transformers have driven the evolution of large language models such as GPT-4.0, DeepSeek R1, and Llama 3.3, enabling richer and more personalized responses. However, prevailing RLHF paradigms—from Proximal Policy Optimization (PPO) to Direct Preference Optimization (DPO)—still rely on binary‐preference labels. While this approach reduces immediate annotation costs, it demands extensive human labeling and captures only coarse, group‐level tastes. It suffers from high annotation overhead and limited adaptability to individual users. To address these limitations, we introduce Adaptive Reward‐Following (ARF), a self‐assessment framework that converts free‐form user feedback into continuous preference signals using a high-precision satisfaction scorer (70\% accuracy on GoEmotions, Sentiment140, and DailyDialog). We further refine and debias these signals through data augmentations—synonym replacement, trace truncation, and score bias annotation—and use a Dynamic Adapter Preference Tracker to model evolving user tendencies in real time. Building on these components, our novel Trace Bias (TB) fine-tuning algorithm optimizes continuous reward trajectories rather than binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains show that ARF outperforms PPO by 3.3\% and DPO by 7.6\%, while maintaining alignment with RLHF objectives. ARF delivers a scalable, personalized, and cost-effective paradigm for next-generation RLHF in large language models.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: reinforcement learning, low-resource learning, efficient optimization, self-supervised learning
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 219
Loading