Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs

Published: 28 Oct 2023, Last Modified: 04 Dec 2023NeurIPS2023-AI4Science PosterEveryoneRevisionsBibTeX
Keywords: protein structure; multimodal; graph neural network; contrastive learning
TL;DR: We propose a contrastive learning framework to pre-train graph neural networks by leveraging protein language models, and empirically demonstrate the effectiveness of the obtained structural embeddings for downstream tasks.
Abstract: Understanding protein function is vital for drug discovery, disease diagnosis, and protein engineering. While Protein Language Models (PLMs) pre-trained on vast protein sequence datasets have achieved remarkable success, equivalent Protein Structure Models (PSMs) remain underrepresented. We attribute this to the relative lack of high-confidence structural data and suitable pre-training objectives. In this context, we introduce BioCLIP, a contrastive learning framework that pre-trains PSMs by leveraging PLMs, generating meaningful per-residue and per-chain structural representations. When evaluated on tasks such as protein-protein interaction, Gene Ontology annotation, and Enzyme Commission number prediction, BioCLIP-trained PSMs consistently outperform models trained from scratch and further enhance performance when merged with sequence embeddings. Notably, BioCLIP approaches, or exceeds, specialized methods across all benchmarks using its singular pre-trained design. Our work addresses the challenges of obtaining quality structural data and designing self-supervised objectives, setting the stage for more comprehensive models of protein function. Source code is publicly available.
Submission Track: Original Research
Submission Number: 81