How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Joel Niklaus, Atsuki Yamaguchi, Michal Stefánik, Guilherme Penedo, Hynek Kydlícek, Elie Bakouch, Lewis Tunstall, Edward Emanuel Beeching, Thibaud Frere, Colin Raffel, Leandro von Werra, Thomas Wolf

Published: 2026, Last Modified: 13 May 2026CoRR 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Loading