Split and Merge Proxy: pre-training protein inter-chain contact prediction by mining rich information from monomer data

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: pdf
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Protein Bioinformatics, Protein Inter-chain Contact Prediction, Pre-training
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This paper introduces the Split and Merge Proxy (SMP), a simple yet effective pre-training framework for protein inter-chain contact prediction to solve the limited number of multimers by using rich monomer information.
Abstract: Protein inter-chain contact prediction is a key intelligent biology computation technology for protein multimer function analysis but still suffers from low accuracy. An important problem is that the number of training data cannot meet the requirements of deep-learning-based methods due to the expensive cost of capturing structure information of multimer data. In this paper, we solve this data volume bottleneck in a cheap way, borrowing rich information from monomer data. To utilize monomer (single chain) data in this multimer (multiple chains) problem, we propose a simple but effective pre-training method called Split and Merge Proxy (SMP), which utilizes monomer data to construct a proxy task for model pre-training. This proxy task cuts monomer data into two sub-parts, called pseudo multimer, and pre-trains the model to merge them back together by predicting their pseudo contacts. The pre-trained model is then used to initialize our target -- protein inter-chain contact prediction. Because of the consistency between this proxy task and the final target, the whole method brings a stronger pre-trained model for subsequent fine-tuning, leading to significant performance gains. Extensive experiments validate the effectiveness of our method and show the model performs better than the state-of-the-art (SOTA) method by 11.40\% and 2.97\% on the P@ $L/10$ metric for bounded benchmarks DIPS-Plus and CASP-CAPRI, respectively. Further, the model also achieves almost 1.5 times performance superiority to the SOTA approach on the harder unbounded benchmark DB5. Finally, we also effectively apply our SMP on docking and interaction site prediction tasks to verify the SMP is a general method for other multimer-related tasks. The code, model, and pre-training data will be released after this paper is accepted.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4581
Loading