DABS 2.0: Improved Datasets and Algorithms for Universal Self-Supervision

Alex Tamkin; Gaurab Banerjee; Mohamed Owda; Vincent Liu; Shashank Rammoorthy; Noah Goodman

DABS 2.0: Improved Datasets and Algorithms for Universal Self-Supervision

Alex Tamkin, Gaurab Banerjee, Mohamed Owda, Vincent Liu, Shashank Rammoorthy, Noah Goodman

Published: 17 Sept 2022, Last Modified: 23 May 2023NeurIPS 2022 Datasets and Benchmarks Readers: Everyone

Keywords: self-supervised learning, domain agnostic

TL;DR: We extend the DABS benchmark, presenting improved datasets and algorithms for universal self-supervision

Abstract: Universal self-supervised (SSL) algorithms hold enormous promise for making machine learning accessible to high-impact domains such as protein biology, manufacturing, and genomics. We present DABS 2.0: a set of improved datasets and algorithms for advancing research on universal SSL. We extend the recently-introduced DABS benchmark with the addition of five real-world science and engineering domains: protein biology, bacterial genomics, multispectral satellite imagery, semiconductor wafers, and particle physics, bringing the total number of domains in the benchmark to twelve. We also propose a new universal SSL algorithm, Capri, and a generalized version of masked autoencoding, and apply both on all twelve domains---the most wide-ranging exploration of SSL yet. We find that multiple algorithms show gains across domains, outperforming previous baselines. In addition, we demonstrate the usefulness of DABS for scientific study of SSL by investigating the optimal corruption rate for each algorithm, showing that the best setting varies based on the domain. Code will be released at http://github.com/alextamkin/dabs}{http://github.com/alextamkin/dabs

Author Statement: Yes

URL: dabs.stanford.edu

Supplementary Material: pdf

Contribution Process Agreement: Yes

In Person Attendance: Yes

License: MIT License

13 Replies

Loading