Starling-7B: Improving Helpfulness and Harmlessness with RLAIF

Banghua Zhu; Evan Frick; Tianhao Wu; Hanlin Zhu; Karthik Ganesan; Wei-Lin Chiang; Jian Zhang; Jiantao Jiao

Starling-7B: Improving Helpfulness and Harmlessness with RLAIF

Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, Karthik Ganesan, Wei-Lin Chiang, Jian Zhang, Jiantao Jiao

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Alignment, Data, Evaluation, Safety, Learning algorithms for LMs

Keywords: High quality dataset, Alignment, RLHF

TL;DR: We present the high quality dataset Nectar, and open source reward model Starling-7B-RM and the current best 7B language model Starling-7B-LM

Abstract: This paper presents Starling-7B, the current best-performing 7B chat model on Chatbot Arena, along with its training dataset Nectar, a high-quality preference dataset collected by prompting GPT-4 to rank responses. We propose an internal pairwise rating technique, where the model considers all pairings before providing a ranking decision, leveraging the proven pairwise rating capability of LLMs without the cost of individual pairwise calls. The resulting Nectar dataset comprises 182,954 chat prompts, each with seven responses from various models, ranked by GPT-4, equating to 3.8 million high-quality pairwise comparisons. We introduce Starling-RM-7B and Starling-RM-34B, the reward model suites trained with a K-wise preference loss on Nectar, outperforming pairwise counterparts. We benchmark reward model training pipelines across metrics such as human preference, truthfulness, and safety. Using Nectar and our new training pipeline, we fine-tuned Openchat-3.5 to create Starling-LM-7B, achieving significant performance enhancements on MT-Bench, AlpacaEval, and human evaluation metrics. To facilitate research and understanding of RLHF mechanisms, we open-source the Nectar dataset, the reward models, and the language models.

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Flagged For Ethics Review: true

Ethics Comments: the paper mentions distilling from GTP-* output. I am not sure if that is allowed under the Terms of Service?

Submission Number: 1303

Loading