Keywords: Iterated RLHF, Reward Model Overoptimisation, Alignment
Abstract: Reinforcement learning from human feedback (RLHF) aligns large language models with human preferences but often suffers from reward model overoptimisation, where models exploit quirks of the reward function rather than generalising. A common mitigation is iterated RLHF, which repeatedly retrains reward models with new feedback and re-optimises policies. Despite its growing use, the dynamics of overoptimisation in this setting remain unclear. We present the first systematic study of iterated RLHF, analysing how data transfer, reward choice, and policy initialisation affect outcomes. Using the AlpacaFarm benchmark, we find that overoptimisation decreases across iterations as reward models better approximate preferences, but performance gains plateau. Reinitialising from the base policy is robust yet constrains optimisation, while alternative strategies struggle to recover from early overoptimisation. These results provide practical guidance for more stable and generalisable RLHF pipelines.
Submission Number: 80
Loading