Keywords: biology, large language model
Abstract: The prediction of catalytic interactions between enzymes and their substrates forms the foundation of numerous biological and medical processes, which, however comes with the challenge of inherent multi-modality of the biological information. Most link prediction methods are unimodal, and remain inadequate for complex biochemical reasoning and the integration of heterogeneous biological modality. Meanwhile, although current protein language models perform well in amino acid sequences embedding, they fall short in reasoning. To tackle these issues, we propose \textbf{CataRAG}, the first retrieval-augmented prediction biological reasoning framework, which builds on the multi-modal optimal transport theory to systematically integrate three key types of modalities, i.e., amino acid sequences, molecular structures of enzymatic substrates, and biological knowledge graphs, and connect heterogeneous biological data representations.
Then the weighted average of the final retrieval results from heterogeneous sources can augment complex biological reasoning capabilities of LLMs without the need of retraining.
Extensive experiments demonstrate on catalytic reaction prediction tasks that CataRAG significantly outperforms existing state-of-the-art LLMs, RAG baselines, and graph neural network models, with the crucial role of three modalities validated in ablation studies.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 9138
Loading