ProteinShake: Building datasets and benchmarks for deep learning on protein structures

Published: 26 Sept 2023, Last Modified: 16 Jan 2024NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX
Keywords: protein structure, benchmarking, datasets
TL;DR: Datasets and evaluation tasks for protein 3D structure data.
Abstract: We present ProteinShake, a Python software package that simplifies dataset creation and model evaluation for deep learning on protein structures. Users can create custom datasets or load an extensive set of pre-processed datasets from biological data repositories such as the Protein Data Bank (PDB) and AlphaFoldDB. Each dataset is associated with prediction tasks and evaluation functions covering a broad array of biological challenges. A benchmark on these tasks shows that pre- training almost always improves performance, the optimal data modality (graphs, voxel grids, or point clouds) is task-dependent, and models struggle to generalize to new structures. ProteinShake makes protein structure data easily accessible and comparison among models straightforward, providing challenging benchmark settings with real-world implications. ProteinShake is available at:
Supplementary Material: pdf
Submission Number: 816