Keywords: provenance, reproducibility, model cards, machine learning library
TL;DR: How integrated provenance collection influenced the design of a production ML library.
Abstract: Data Provenance is a formal record documenting how a digital artifact came to be in its present state. In the context of a Machine Learning model, provenance includes the data sources, data transformations, and algorithmic hyperparameters that are used to create the model. We present the design of Tribuo (Website: https://tribuo.org, GitHub: https://github.com/oracle/tribuo), an open-source, production ML library with integrated data provenance. Tribuo collects provenance automatically requiring no user action or intervention. Using the provenance data, we developed systems for reproducing ML models and generating model cards. Like a type-system, integrated provenance collection constrains design choices and provides utility in other parts of the system. Our integrated provenance approach has allowed us to automatically fix bugs in old models, detect non-obvious platform dependencies and deeply understand and debug models built by other groups. Integrating provenance collection into the library influences the design and evolution of the system, which requires making trade-offs among provenance fidelity, provenance size, and developer flexibility.
Submission Number: 14
Loading