Abstract: The core challenge of de novo protein design lies in creating proteins
with specific functions or properties, guided by certain conditions.
Current models explore to generate protein using structural and
evolutionary guidance, which only provide indirect conditions concerning functions and properties. However, textual annotations of
proteins, especially the annotations for protein domains, which
directly describe the protein’s high-level functionalities, properties,
and their correlation with target amino acid sequences, remain
unexplored in the context of protein design tasks. In this paper,
we propose Protein-Annotation Alignment Generation (PAAG), a
multi-modality protein design framework that integrates the textual
annotations extracted from protein database for controllable generation in sequence space. Specifically, within a multi-level alignment
module, PAAG can explicitly generate proteins containing specific
domains conditioned on the corresponding domain annotations,
and can even design novel proteins with flexible combinations of
different kinds of annotations. Our experimental results underscore
the superiority of the aligned protein representations from PAAG
over 7 prediction tasks. Furthermore, PAAG demonstrates a significant increase in generation success rate (24.7% vs 4.7% in zinc
finger, and 54.3% vs 22.0% in the immunoglobulin domain) in comparison to the existing model. We anticipate that PAAG will broaden the horizons of protein design by leveraging the knowledge from
between textual annotation and proteins.
Loading