Protein Captioning: Bridging the Gap between Protein Sequences and Natural Languages

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: natural language processing, protein representation learning, protein captioning
Abstract: We introduce the task of Protein Captioning, which is an easy-to-understand and flexible way for protein analysis. Compared to specific protein recognition or classification tasks, such as enzyme reaction classification and gene ontology term prediction, protein captioning provides comprehensive textural descriptions for proteins, thus playing a key role in bridging the gap between protein sequences and natural languages. To address the problem, we propose a simple yet effective method, Protein-to-Text Generative Pre-trained Transformer (P2T-GPT), to translate the chain of amino acid residues in a protein to a sequence of natural language words, i.e., text. For the evaluation of protein captioning, we collect a ProteinCap dataset that contains 94,454 protein-text pairs. Experiments on ProteinCap demonstrate the effectiveness of the proposed P2T-GPT on protein captioning. As minor contributions, first, P2T-GPT provides a way to connect protein science and Large Language Models (LLMs). By appending ChatGPT, our method can interact in a conversational way to answer questions given a protein. Second, we show that protein captioning can be treated as a pre-trained task that can benefit a range of downstream tasks, to a certain extent. The code has been submitted in the supplementary material and will be publicly available.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5002
Loading