Evaluating Representation Learning on the Protein Structure Universe

Published: 16 Jan 2024, Last Modified: 11 Feb 2024ICLR 2024 posterEveryoneRevisionsBibTeX
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Protein, Representation, Learning, Protein Structure
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Benchmark for evaluating pre-training and representation learning on protein structures.
Abstract: Protein structure representation learning is the foundation for promising applications in drug discovery, protein design, and protein function prediction. However, there remains a need for a robust, standardised benchmark to track the progress of new and established methods with greater granularity and relevance to downstream applications. In this work, we introduce a comprehensive and open benchmark suite for evaluating protein structure representation learning methods. We provide several pre-training methods, downstream tasks and pre-training corpora comprised of both experimental and predicted structures, offering a balanced challenge to representation learning algorithms. These tasks enable the systematic evaluation of the quality of the learned embeddings, the structural and functional relationships captured, and their usefulness in downstream tasks. We benchmark state-of-the-art protein-specific and generic geometric Graph Neural Networks and the extent to which they benefit from different types of pre-training. We find that pre-training consistently improves the performance of both rotation-invariant and equivariant models, and that equivariant models seem to benefit even more from pre-training compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to collaborate, compare, and advance protein structure representation learning. By providing a standardised and rigorous evaluation platform, we expect to accelerate the development of novel methodologies and improve our understanding of protein structures and their functions. The codebase incorporates several engineering contributions which considerably reduces the barrier to entry for pre-training and working with large structure-based datasets. Our benchmark is available at: https://anonymous.4open.science/r/ProteinWorkshop-B8F5/
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6130