Illuminating Protein Function Prediction through Inter-Protein Similarity Modeling

Zuobai Zhang; Jiarui Lu; Vijil Chenthamarakshan; Aurelie Lozano; Payel Das; Jian Tang

Illuminating Protein Function Prediction through Inter-Protein Similarity Modeling

Zuobai Zhang, Jiarui Lu, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, Jian Tang

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Protein function prediction, retrieveal-based methods, transductive learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Proteins, central to biological systems, are complex due to interactions between sequences, structures, and functions shaped by physics and evolution, posing a challenge for accurate function prediction. Recent advancements in deep learning techniques demonstrate substantial potential for precise function prediction through learning representations from extensive protein sequences and structures. Nevertheless, practical function annotation heavily relies on modeling protein similarity using sequence or structure retrieval tools, given their accuracy and interpretability. To study the effect of inter-protein similarity modeling, in this paper, we comprehensively benchmark the retriever-based methods against predictors on protein function tasks, demonstrating the potency of retriever-based approaches. Inspired by these findings, we introduce an innovative variational pseudo-likelihood framework, ProtIR, designed to improve function prediction through iterative refinement between predictors and retrievers. ProtIR combines the strengths of both predictors and retrievers, showcasing an around 10% improvement over vanilla predictor-based methods. Furthermore, it achieves comparable performance to the state-of-the-art protein language model-based methods with significantly smaller training time, highlighting the efficacy of our approach.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3949

Loading