Abstract: Large-scale datasets are foundational for re-
search and development in natural language
processing. However, current approaches face
three key challenges: (1) reliance on ambigu-
ously licensed sources restricting use, sharing,
and derivative works; (2) static dataset releases
that prevent community contributions and di-
minish longevity; and (3) quality assurance pro-
cesses restricted to publishing teams rather than
leveraging community expertise.
To address these limitations, we introduce
two contributions: the Dynaword approach
and Danish Dynaword1. The Dynaword ap-
proach is a framework for creating large-scale,
open datasets that can be continuously updated
through community collaboration. Danish Dy-
naword is a concrete implementation that vali-
dates this approach and demonstrates its poten-
tial. Danish Dynaword contains over four times
as many tokens as comparable releases, is ex-
clusively openly licensed, and has received mul-
tiple contributions across industry and research.
The repository includes light-weight tests to
ensure data formatting, quality, and documen-
tation, establishing a sustainable framework for
ongoing community contributions and dataset
evolution.
Loading