Dynaword: From One-shot to Continuously Developed Datasets

Kenneth Enevoldsen, Kristian Nørgaard Jensen, Jan Kostkan, Balázs Szabó, Márton Kardos, Kirsten Vad, Johan Heinsen, Andrea Blasi Núñez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Bjerregaard Vahlstrup, Per Møldrup-Dalum, Desmond Elliott, Lukas Galke, Peter Schneider-Kamp, Kristoffer Nielbo

Published: 23 Sept 2025, Last Modified: 07 Feb 2026To be submitted (ArXiv)EveryoneRevisionsCC BY-SA 4.0

Abstract: Large-scale datasets are foundational for re- search and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambigu- ously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and di- minish longevity; and (3) quality assurance pro- cesses restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword1. The Dynaword ap- proach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dy- naword is a concrete implementation that vali- dates this approach and demonstrates its poten- tial. Danish Dynaword contains over four times as many tokens as comparable releases, is ex- clusively openly licensed, and has received mul- tiple contributions across industry and research. The repository includes light-weight tests to ensure data formatting, quality, and documen- tation, establishing a sustainable framework for ongoing community contributions and dataset evolution.