Protein language models are biased by unequal sequence sampling across the tree of life

Published: 04 Mar 2024, Last Modified: 29 Apr 2024GEM PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Machine learning: computational method and/or computational results
Cell: I do not want my work to be considered for Cell Systems
Keywords: protein language model, protein design, protein fitness, evolution, bias
TL;DR: Protein language models have a species bias caused by training data imbalance, and this bias can be detrimental for protein design.
Abstract: Protein language models (pLMs) trained on large protein sequence databases have been used to understand disease and design novel proteins. In design tasks, the likelihood of a protein sequence under a pLM is often used as a proxy for protein fitness, so it is critical to understand what signals likelihoods capture. In this work we find that pLM likelihoods unintentionally encode a species bias: likelihoods of protein sequences from certain species are systematically higher, independent of the protein in question. We quantify this bias and show that it arises in large part because of unequal species representation in popular protein sequence databases. We further show that the bias can be detrimental for some protein design applications, such as enhancing thermostability. These results highlight the importance of understanding and curating pLM training data to mitigate biases and improve protein design capabilities in under-explored parts of sequence space.
Submission Number: 79
Loading