Renku: a platform for sustainable data science

NeurIPS 2023 Track Datasets and Benchmarks Submission519 Authors

Published: 26 Sept 2023, Last Modified: 02 Feb 2024NeurIPS 2023 Datasets and Benchmarks SpotlightEveryoneRevisionsBibTeX
Keywords: reproducibility, reusability, platforms, sustainability, community, dataset development
TL;DR: Renku is a platform that enables and encourages sustainable data science and machine learning practices, from dataset creation to dissemination.
Abstract: Data and code working together is fundamental to machine learning (ML), but the context around datasets and interactions between datasets and code are in general captured only rudimentarily. Context such as how the dataset was prepared and created, what source data were used, what code was used in processing, how the dataset evolved, and where it has been used and reused can provide much insight, but this information is often poorly documented. That is unfortunate since it makes datasets into black-boxes with potentially hidden characteristics that have downstream consequences. We argue that making dataset preparation more accessible and dataset usage easier to record and document would have significant benefits for the ML community: it would allow for greater diversity in datasets by inviting modification to published sources, simplify use of alternative datasets and, in doing so, make results more transparent and robust, while allowing for all contributions to be adequately credited. We present a platform, Renku, designed to support and encourage such sustainable development and use of data, datasets, and code, and we demonstrate its benefits through a few illustrative projects which span the spectrum from dataset creation to dataset consumption and showcasing.
Supplementary Material: pdf
Submission Number: 519
Loading