$\texttt{LLaPA}$: Harnessing Language Models for Protein Enzyme Function

Jie Peng; Zijie Liu; Sukwon Yun; Yanyong Zhang; Tianlong Chen

$\texttt{LLaPA}$: Harnessing Language Models for Protein Enzyme Function

Jie Peng, Zijie Liu, Sukwon Yun, Yanyong Zhang, Tianlong Chen

27 Sept 2024 (modified: 16 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Protein Enzyme Funtion; Large Language Model; Retrieval Augmented Generation

TL;DR: We introduce "LLaPA," a language model that boosts enzyme prediction accuracy via EC number reformulation and a two-tiered retrieval system.

Abstract: Identifying protein enzyme functions, crucial for numerous applications, is challenging due to the rapid growth in protein sequences. Current methods either struggle with false positives or fail to generalize to lesser-known proteins and those with uncharacterized functions. To tackle these challenges, we propose $\texttt{LLaPA}$: a Protein-centric $\underline{L}$arge $\underline{L}$anguage and $\underline{P}$rotein $\underline{A}$ssistant for Enzyme Commission (EC) number prediction. $\texttt{LLaPA}$ uses a large multi-modal model to accurately predict EC numbers by reformulating the EC number format within the LLM self-regression framework. We introduce a dual-level protein-centric retrieval: the $\textit{protein-level}$ retrieves protein sequences with similar regions, and the $\textit{chemical-level}$ retrieves corresponding molecules with relevant reaction information. By inputting the original protein along with the retrieved protein and molecule into the LLM, $\texttt{LLaPA}$ achieves improved prediction accuracy, with enhanced generalizability to lesser-known proteins. Evaluations on three public benchmarks show accuracy improvements of $\textbf{17.03\\%}$, $\textbf{9.32\\%}$, and $\textbf{38.64\\%}$. These results highlight $\texttt{LLaPA}$'s ability to generalize to novel protein sequences and functionalities. Codes are provided in the supplement.

Supplementary Material: zip

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9408

Loading