Dynaword: From One-shot to Continuously Developed Datasets

Published: 23 Sept 2025, Last Modified: 23 Sept 2025To be submitted (ArXiv)EveryoneCC BY-SA 4.0
Abstract: Large-scale datasets are foundational for re- search and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambigu- ously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and di- minish longevity; and (3) quality assurance pro- cesses restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword1. The Dynaword ap- proach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dy- naword is a concrete implementation that vali- dates this approach and demonstrates its poten- tial. Danish Dynaword contains over four times as many tokens as comparable releases, is ex- clusively openly licensed, and has received mul- tiple contributions across industry and research. The repository includes light-weight tests to ensure data formatting, quality, and documen- tation, establishing a sustainable framework for ongoing community contributions and dataset evolution.
Loading