Starling-7B: Improving Helpfulness and Harmlessness with RLAIF

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: Alignment, Data, Evaluation, Safety, Learning algorithms for LMs
Keywords: High quality dataset, Alignment, RLHF
TL;DR: We present the high quality dataset Nectar, and open source reward model Starling-7B-RM and the current best 7B language model Starling-7B-LM
Abstract: This paper presents Starling-7B, the current best-performing 7B chat model on Chatbot Arena, along with its training dataset Nectar, a high-quality preference dataset collected by prompting GPT-4 to rank responses. We propose an internal pairwise rating technique, where the model considers all pairings before providing a ranking decision, leveraging the proven pairwise rating capability of LLMs without the cost of individual pairwise calls. The resulting Nectar dataset comprises 182,954 chat prompts, each with seven responses from various models, ranked by GPT-4, equating to 3.8 million high-quality pairwise comparisons. We introduce Starling-RM-7B and Starling-RM-34B, the reward model suites trained with a K-wise preference loss on Nectar, outperforming pairwise counterparts. We benchmark reward model training pipelines across metrics such as human preference, truthfulness, and safety. Using Nectar and our new training pipeline, we fine-tuned Openchat-3.5 to create Starling-LM-7B, achieving significant performance enhancements on MT-Bench, AlpacaEval, and human evaluation metrics. To facilitate research and understanding of RLHF mechanisms, we open-source the Nectar dataset, the reward models, and the language models.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Flagged For Ethics Review: true
Ethics Comments: the paper mentions distilling from GTP-* output. I am not sure if that is allowed under the Terms of Service?
Submission Number: 1303
Loading