The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization

Shengyi Huang; Michael Noukhovitch; Arian Hosseini; Kashif Rasul; Weixun Wang; Lewis Tunstall

The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization

Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, Lewis Tunstall

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Alignment, Engineering for large LMs, Learning algorithms for LMs

Keywords: Reinforcement Learning from Human Feedback, RLHF

TL;DR: Enumerated 20+ implementation details of RLHF and reproduced RLHF scaling behaviors in prior closed-source work

Abstract: This work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF) scaling behaviors reported in OpenAI's seminal TL;DR summarization work. We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in response quality that scale with model size, with our 2.8B, 6.9B models outperforming OpenAI's released 1.3B checkpoint. Our results highlight best practices in data, training, and evaluation for RLHF. We publicly release the trained model checkpoints and code to facilitate further research and accelerate progress in the field at https://github.com/vwxyzjn/summarize_from_feedback_details

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 754

Loading