Abstract: A major goal in biotechnology is to generate libraries of functional proteins that display useful phenotypes. Towards this goal, previous approaches have leveraged probabilistic models of evolutionary sequences to design proteins reflecting the constraints that govern natural evolution. Other approaches have incorporated labeled data from experiments reflecting a desired phenotype, either alone or alongside models of evolutionary sequences, to design proteins exhibiting a useful functional property. With the goal of minimizing experimental effort and accelerating design cycles, we seek to quantify the minimal amounts and types of evolutionary and experimental data required for designing novel sequences with useful properties, and to identify the best models for utilizing all available data. Using a published model dataset of AAV gene therapy vector designs developed to achieve a desired tissue tropism, we evaluate models using evolutionary and experimental data independently and in concert for their ability to predict capsid liver targeting. We find that particularly when using data on capsid formation for the related phenotype of liver tropism and when evaluating sequences farther away from the wild-type, natural sequence data becomes more important and a combination of both data-types outperforms other supervised and unsupervised benchmarks. We introduce a semi-supervised Bayesian approach trained on a combination of evolutionary sequences and capsid viability that can best predict AAV2 liver tropism for sequences greater than 3 mutations away from wild-type. This has beneficial implications for the design of diverse and functional AAV2 libraries, as well as the broader objective of protein design.
0 Replies
Loading