($\texttt{PEEP}$) $\textbf{P}$redicting $\textbf{E}$nzym$\textbf{e}$ $\textbf{P}$romiscuity with its Molecule Mate – an Attentive Metric Learning Solution

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Protein Engineering; Metric Learning;
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a metric learning framework that incorporates biological priors to improve the quality of functional annotations of proteins.
Abstract: Annotating the functions of proteins (e.g., enzymes) is a fundamental challenge, due to their diverse functionalities and rapidly increased number of protein sequences in databases. Traditional approaches have limited capability and suffer from false positive predictions. Recent machine learning (ML) methods reach satisfactory prediction accuracy but still fail to generalize, especially for less-studied proteins and those with previously uncharacterized functions or promiscuity. To address these pain points, we propose a novel ML algorithm, PEEP, to predict enzyme promiscuity, which integrates biology priors of protein functionality to regularize the model learning. To be specific, at the input level, PEEP fuses the corresponding molecule into protein embeddings to gain their reaction information; at the model level, a tailored self-attention is leveraged to capture importance residues which we found are aligned with the active site in protein pocket structure; at the objective level, we embed functionality label hierarchy into metric learning objectives by imposing larger distance margin between proteins that have less functionality in common. PEEP is extensively validated on three public benchmarks, achieving up to 4.6%,3.1%,3.7% improvements on F-1 scores compared to existing methods. Moreover, it demonstrates impressive generalization to unseen protein sequences with unseen functionalities. Codes are included in the supplement.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6470
Loading