Abstract: As a key component of reinforcement learning from human feedback (RLHF), reward learning directly influences the final learned policy.
Unfortunately, existing theoretical estimation error bounds in reward learning rely on the complexity of the reward function class, unattainable optimal parameters, or non-zero constants independent of sample size, leading to uncomputable bounds that are meaningless for reward function classes with unknown complexity.
To address this issue,
this paper presents an analysis of parameter estimation for reward learning in RLHF under general function approximation, without imposing restrictions on the complexity of the reward function class.
A tighter bound is provided without non-zero terms independent of the sample size.
The optimal parameters are eliminated by applying linear approximation around the learned parameters.
Additionally, the relationship between the preference dataset and the learned parameters is further examined to demonstrate how to efficiently collect data based on the current learned parameters.
Inspired by the theoretical results,
a novel offline RLHF algorithm with parameter constraints is proposed, restricting parameters to the valid space defined by the dataset.
Furthermore, an online RLHF algorithm is proposed to iteratively optimize parameter learning and improve data collection efficiency.
This work provides a tighter bound than previous studies and offers theoretical guidance for online data collection under general function approximation.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Shaofeng_Zou1
Submission Number: 7005
Loading