trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback

Alexander Havrilla; Maksym Zhuravinskyi; Duy Van Phung; Aman Tiwari; Jonathan Tow; Stella Biderman; Quentin Gregory Anthony; Louis Castricato

trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback

Alexander Havrilla, Maksym Zhuravinskyi, Duy Van Phung, Aman Tiwari, Jonathan Tow, Stella Biderman, Quentin Gregory Anthony, Louis Castricato

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Language Modeling and Analysis of Language Models

Submission Track 2: Theme Track: Large Language Models and the Future of NLP

Keywords: RLHF, LLM, Framework

TL;DR: An open-source framework for RLHF fine-tuning at scale

Abstract: Reinforcement learning from human feedback (\textbf{RLHF}) utilizes human feedback to better align large language models with human preferences via online optimization against a learned reward model. Current RLHF paradigms rely on Proximal Policy Optimization (\textbf{PPO}), which quickly becomes a challenge to implement and scale up to large architectures. To address this difficulty we present the \textbf{AutoRLHF} library as a feature complete open-source framework for RLHF fine-tuning of models up to and exceeding 70 billion parameters. To do so we implement support for multiple types of distributed training including distributed data parallel, model sharded, as well as tensor, sequential, and pipeline parallelism. Additionally, we implement compute and memory saving features, giving AutoRLHF the flexibility to support users with a wide range of compute resources. This includes offline RL methods like Implicit Language Q Learning (\textbf{ILQL}) as a compute efficient alternative to PPO. We find offline fine-tuning offers competitive performance relative to online algorithms while being easier to implement, train, and scale. To evaluate our framework we train RLHF models on two separate well-known tasks using publicly available human preference data. Models trained with AutoRLHF achieve preference win-rates over baselines at rates comparable to the original works.

Submission Number: 2125

Loading