trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Language Modeling and Analysis of Language Models
Submission Track 2: Theme Track: Large Language Models and the Future of NLP
Keywords: RLHF, LLM, Framework
TL;DR: An open-source framework for RLHF fine-tuning at scale
Abstract: Reinforcement learning from human feedback (\textbf{RLHF}) utilizes human feedback to better align large language models with human preferences via online optimization against a learned reward model. Current RLHF paradigms rely on Proximal Policy Optimization (\textbf{PPO}), which quickly becomes a challenge to implement and scale up to large architectures. To address this difficulty we present the \textbf{AutoRLHF} library as a feature complete open-source framework for RLHF fine-tuning of models up to and exceeding 70 billion parameters. To do so we implement support for multiple types of distributed training including distributed data parallel, model sharded, as well as tensor, sequential, and pipeline parallelism. Additionally, we implement compute and memory saving features, giving AutoRLHF the flexibility to support users with a wide range of compute resources. This includes offline RL methods like Implicit Language Q Learning (\textbf{ILQL}) as a compute efficient alternative to PPO. We find offline fine-tuning offers competitive performance relative to online algorithms while being easier to implement, train, and scale. To evaluate our framework we train RLHF models on two separate well-known tasks using publicly available human preference data. Models trained with AutoRLHF achieve preference win-rates over baselines at rates comparable to the original works.
Submission Number: 2125