Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert; Jacob Morrison; Valentina Pyatkin; Shengyi Huang; Hamish Ivison; Faeze Brahman; Lester James Validad Miranda; Alisa Liu; Nouha Dziri; Xinxi Lyu; Yuling Gu; Saumya Malik; Victoria Graf; Jena D. Hwang; Jiangjiang Yang; Ronan Le Bras; Oyvind Tafjord; Christopher Wilhelm; Luca Soldaini; Noah A. Smith; Yizhong Wang; Pradeep Dasigi; Hannaneh Hajishirzi

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: post-training, reinforcement learning, preference learning

TL;DR: A new multi-stage post-training recipe scaling preference learning and reinforcement learning with verifiable rewards.

Abstract: Language model post-training is applied to refine behaviors and unlock new skills across a wide range of language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most im- portant pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce TÜLU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. TÜLU 3, which builds on Llama 3.1 base models at 8B, 70B and 405B parameters, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, and Mistral at comparable model sizes. The 405B TÜLU 3 performs compet- itively against closed models such as GPT-4o-mini and Claude 3.5-Haiku or large open models like DeepSeek V3. The training algorithms for our mod- els include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). We detail varying the objective, model initialization, gen- eralization, and over-optimization of this new RL finetuning method. With TÜLU 3, we build a multi-task evaluation scheme for post-training with development and unseen evaluations, standard benchmark implementa- tions, and substantial decontamination of existing open datasets on said benchmarks. The TÜLU 3 release includes model weights, a demo, and the complete recipe — datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed recipe for reproducing and further adapting the TÜLU 3 approach to more domains.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Award Nomination: true

Submission Number: 508

Loading