Keywords: alignment, model safety, rlhf, reinforcement learning from human feedback, red teaming
TL;DR: This study found that Reinforcement Learning from Human Feedback techniques applied to LLMs often worsened or failed to improve both covert and overt biases against African Americans, suggesting that RLHF may not align to the desired objective.
Abstract: Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, RLOO) to Llama 3 8B and evaluated the resulting models using matched-guise probing and explicit bias testing.Our findings suggest that RLHF may not effectively align LLMs as intended. In most cases, RLHF either worsened both covert and overt biases or left them relatively unchanged compared to the base model. These results indicate that current RLHF techniques fail to address underlying biases introduced during pretraining, particularly for ambiguous objectives like harmlessness. Our study highlights the need for the development of improved techniques to ensure genuine alignment of LLMs with abstract alignment goals.
Submission Number: 6
Loading