Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

TMLR Paper5186 Authors

23 Jun 2025 (modified: 02 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Previous RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. Our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=CQuuBIAK1n
Changes Since Last Submission: To improve clarity and address presentation concerns raised in our previous TMLR submission: - We have streamlined the reference section to under 5 pages. - We added Appendix B, offering a theoretical motivation for finer-grained rewards over traditional bandit formulations.
Assigned Action Editor: ~Alec_Koppel1
Submission Number: 5186
Loading