A domain-specific language for describing machine learning datasets

Published: 01 Jan 2023, Last Modified: 18 May 2025J. Comput. Lang. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Data issues in ML raise the community’s interest in building data best practices.•This work proposes a structured language to describe machine learning datasets.•The language allows describing composition, provenance, and social concerns of data.•A structured format eases the dataset comparison and the replication of ML results.•The language is supported by DescribeML, a VSCode tool to aid in its usage.
Loading