Abstract: Identifying protein enzyme functions, crucial for numerous applications, is challenging due to the rapid growth in protein sequences. Current methods either struggle with false positives or fail to generalize to lesser-known proteins and those with uncharacterized functions. To tackle these challenges, we propose $\texttt{LLaPA}$: a Protein-centric $\underline{L}$arge $\underline{L}$anguage and $\underline{P}$rotein $\underline{A}$ssistant for Enzyme Commission (EC) number prediction. $\texttt{LLaPA}$ uses a large multi-modal model to accurately predict EC numbers by reformulating the EC number format within the LLM self-regression framework. We introduce a dual-level protein-centric retrieval: the $\textit{protein-level}$ retrieves protein sequences with similar regions, and the $\textit{chemical-level}$ retrieves corresponding molecules with relevant reaction information. By inputting the original protein along with the retrieved protein and molecule into the LLM, $\texttt{LLaPA}$ achieves improved prediction accuracy, with enhanced generalizability to lesser-known proteins. Evaluations on three public benchmarks show accuracy improvements of $\textbf{17.03\%}$, $\textbf{9.32\%}$, and $\textbf{38.64\%}$. These results highlight $\texttt{LLaPA}$'s ability to generalize to novel protein sequences and functionalities. Codes are provided in the supplement.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Protein, Large Language Model
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1405
Loading