WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce WildChat-50m, the largest public chat dataset to date, and use it to create ReWild, an SFT mix which outperforms even strong baselines, such as the recent Tulu 3 SFT mixture from Allen AI.
Abstract: Language model (LLM) post-training can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WildChat-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating Re-Wild, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.
Lay Summary: Lots of people these days are training AI on synthetic data. But what distinguishes useful synthetic data from ... the other kind? To answer this question, we generated a lot of synthetic data using a lot of different LLMs (which we called DGMs, for Data Generating Models) and compared the performance of new LLMs trained on that synthetic data, both to each other and to some state-of-the-art open-source datasets. And we used what we learned to curate a new dataset, ReWild, which was better, by some reasonable measures, than existing public SFT datasets. We made all of our data and all of our models public, so that anyone can try training their own.
Link To Code: https://github.com/penfever/wildchat-50m
Primary Area: Deep Learning->Foundation Models
Keywords: deep learning, machine learning, foundation models, llms, large language models, datasets, sft, post-training
Submission Number: 4631
Loading