Parameterized Synthetic Text Generation with SimpleStories

Lennart Finke; Chandan Sreedhara; Thomas Dooms; Mat Allen; Juan Diego Rodriguez; Noa Nabeshima; Thomas Marshall; Dan Braun

Parameterized Synthetic Text Generation with SimpleStories

Lennart Finke, Chandan Sreedhara, Thomas Dooms, Mat Allen, Juan Diego Rodriguez, Noa Nabeshima, Thomas Marshall, Dan Braun

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC0 1.0

Keywords: Synthetic Data, Small Language Models

TL;DR: A dataset of millions of diverse synthetic stories, leading to better small language models.

Abstract: We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained tiny model suite then show improved sample efficiency and model interpretability in comparison with the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier with regards to the fewest-parameter language model that outputs grammatical English.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/SimpleStories/SimpleStories

Code URL: https://github.com/simple-stories/simple_stories_train

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1915

Loading