Keywords: post-training, reinforcement learning, preference learning
TL;DR: A new multi-stage post-training recipe scaling preference learning and reinforcement learning with verifiable rewards.
Abstract: Language model post-training is applied to refine behaviors and unlock
new skills across a wide range of language models, but open recipes for
applying these techniques lag behind proprietary ones. The underlying
training data and recipes for post-training are simultaneously the most im-
portant pieces of the puzzle and the portion with the least transparency. To
bridge this gap, we introduce TÜLU 3, a family of fully-open state-of-the-art
post-trained models, alongside its data, code, and training recipes, serving
as a comprehensive guide for modern post-training techniques. TÜLU 3,
which builds on Llama 3.1 base models at 8B, 70B and 405B parameters,
achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5,
and Mistral at comparable model sizes. The 405B TÜLU 3 performs compet-
itively against closed models such as GPT-4o-mini and Claude 3.5-Haiku or
large open models like DeepSeek V3. The training algorithms for our mod-
els include supervised finetuning (SFT), Direct Preference Optimization
(DPO), and a novel method we call Reinforcement Learning with Verifiable
Rewards (RLVR). We detail varying the objective, model initialization, gen-
eralization, and over-optimization of this new RL finetuning method. With
TÜLU 3, we build a multi-task evaluation scheme for post-training with
development and unseen evaluations, standard benchmark implementa-
tions, and substantial decontamination of existing open datasets on said
benchmarks. The TÜLU 3 release includes model weights, a demo, and the
complete recipe — datasets for diverse core skills, a robust toolkit for data
curation and evaluation, the training code and infrastructure, and, most
importantly, a detailed recipe for reproducing and further adapting the
TÜLU 3 approach to more domains.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Award Nomination: true
Submission Number: 508
Loading