DataBright: Towards a Global Exchange for Decentralized Data Ownership and Trusted Computation

David Dao, Dan Alistarh, Claudiu Musat, Ce Zhang

Feb 12, 2018 (modified: Jun 04, 2018) ICLR 2018 Workshop Submission readers: everyone Show Bibtex
  • Abstract: It is safe to assume that, for the foreseeable future, machine learning, especially deep learning will remain both data- and computation-hungry. In this paper, we ask: Can we build a global exchange where everyone can contribute computation and data to train the next generation of machine learning applications? We present an early, but running prototype of DataBright, a system that turns the creation of training examples and the sharing of computation into an investment mechanism. Unlike most crowdsourcing platforms, where the contributor gets paid when they submit their data, DataBright pays dividends whenever a contributor's data or hardware is used by someone to train a machine learning model. The contributor becomes a shareholder in the dataset they created. To enable the measurement of usage, a computation platform that contributors can trust is also necessary. DataBrigh thus merges both a data market and a trusted computation market. We illustrate that trusted computation can enable the creation of an AI market, where each data point has an exact value that should be paid to its creator.DataBright allows data creators to retain ownership of their contribution and attaches to it a measurable value. The value of the data is given by its utility in subsequent distributed computation done on the DataBright computation market. The computation market allocates tasks and subsequent payments to pooled hardware. This leads to the creation of a decentralized AI cloud. Our experiments show that trusted hardware such as Intel SGX can be added to the usual ML pipeline with no additional costs. We use this setting to orchestrate distributed computation that enables the creation of a computation market. DataBright is available for download at
  • Keywords: trusted computation, data market, model parallelism, data parallelism, distributed training