Abstract: A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF can be unbalanced due to adversarial manipulation or inadvertent repetition. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.
Lay Summary: A key challenge in training Large Language Models (LLMs) is properly aligning an LLM with human preferences. This is important because this helps the LLM give answers that are similar to answers that a human would give. Current methods do this alignment by generating pairs of answers and asking humans to choose which answer they prefer. The current method then uses the human preferences to help train the LLM. However, this method may result in different LLM behavior depending on which answers are shown to humans. We study what happens when the answers that are shown to humans are biased. We then introduce a new method that will give the same result even when the humans are shown a biased set of answers.
Primary Area: Social Aspects->Alignment
Keywords: RLHF, AI alignment, social choice
Submission Number: 1905
Loading