RLTHF: Targeted Human Feedback for LLM Alignment

Yifei Xu; Tusher Chakraborty; Emre Kiciman; Bibek Aryal; Srinagesh Sharma; Songwu Lu; Ranveer Chandra

RLTHF: Targeted Human Feedback for LLM Alignment

Yifei Xu, Tusher Chakraborty, Emre Kiciman, Bibek Aryal, Srinagesh Sharma, Songwu Lu, Ranveer Chandra

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: RLTHF is a human-AI hybrid framework that iteratively refines reward model alignment by leveraging LLM-labeled data and strategic human annotations, achieving oracle-level performance with minimal human effort.

Abstract: Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.

Lay Summary: Large language models (LLMs) have shown remarkable capabilities, but making them behave exactly how we want—especially in specific real-world tasks—often requires lots of high-quality human feedback. This process is expensive and slow. Our work introduces RLTHF (Reinforcement Learning from Targeted Human Feedback), a smarter, more efficient way to align LLMs with human preferences. Instead of asking people to label every data point, RLTHF first uses AI to label data and then pinpoints the small portion of examples where the AI is likely wrong or uncertain. Human effort is then focused only on those tricky cases. By combining AI-generated feedback with strategic human corrections, RLTHF achieves results as good as full human annotation—but with only 6–7% of the human effort. Surprisingly, models trained using RLTHF’s curated data even outperform those trained on datasets fully labeled by humans. This approach not only makes LLM alignment cheaper and faster, but also helps organizations customize models without exposing all of their private data to external annotators.

Primary Area: Deep Learning->Large Language Models

Keywords: RLHF, Reward Modeling

Submission Number: 2940

Loading