Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Protein design, Large language model
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Large Language Models (LLMs), like ChatGPT, excel in cross-modal tasks thanks to their powerful abilities in natural language comprehension, generalization, and reasoning. Meanwhile, the wealth of human-curated protein knowledge in text form presents a unique opportunity for LLMs to contribute to advanced protein design. In this work, we propose a new LLMs-based framework, namely NL2ProGPT, for macromolecular protein sequence generation that bridges the domain gap between natural and protein languages. Specifically, we first combine the protein functions and properties to create specific text guidelines for designing the protein, ensuring it follows precise controls. Second, to form a more informative and generalizable protein description, we explicitly inject protein structural information by clustering the embeddings from pre-trained protein language models. Third, we train a reward model to align the protein language model with the Rosetta energy function, following an RLAIF (reinforced learning from AI feedback) fashion. We empirically verify the effectiveness of NL2ProGPT from three aspects: (1) outperforms existing protein sequence design methods in different evaluations; (2) exhibits more than 90\% consistency in text-to-protein generation; (3) has effective exploration potential in disordered regions.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1229
Loading