Evaluating Prompt Tuning for Conditional Protein Sequence Generation

Andrea Nathansen; Kevin Klein; Bernhard Y Renard; Melania Nowicka; Jakub M Bartoszewicz

Evaluating Prompt Tuning for Conditional Protein Sequence Generation

Andrea Nathansen, Kevin Klein, Bernhard Y Renard, Melania Nowicka, Jakub M Bartoszewicz

Published: 06 Mar 2023, Last Modified: 05 May 2023ICLR 2023 - MLDD PosterReaders: Everyone

Keywords: protein language models, protein design, prompt tuning, conditional sequence generation

TL;DR: We introduce a prompt tuning pipeline for protein sequence generation with language models and describe discrepancies between text-based and biological evaluation observed in our use case.

Abstract: Text generation models originally developed for natural language processing have proven to be successful in generating protein sequences. These models are often finetuned for improved performance on more specific tasks, such as generation of proteins from families unseen in training. Considering the high computational cost of finetuning separate models for each downstream task, prompt tuning has been proposed as an alternative. However, no openly available implementation of this approach compatible with protein language models has been previously published. Thus, we adapt an open-source codebase designed for NLP models to build a pipeline for prompt tuning on protein sequence data, supporting the protein language models ProtGPT2 and RITA. We evaluate our implementation by learning prompts for conditional sampling of sequences belonging to a specific protein family. This results in improved performance compared to the base model. However, in the presented use case, we observe discrepancies between text-based evaluation and predicted biological properties of the generated sequences, identifying open problems for principled assessment of protein sequence generation quality.

1 Reply

Loading