Keywords: Continual Learning, Protein Language Models
TL;DR: We introduce a continual pretraining benchmark for protein language models.
Abstract: In recent years, protein language models (pLMs) have gained significant attention
for their ability to capture the structure and function of proteins, accelerating
the discovery of new therapeutic drugs. These models are typically trained on large,
evolving corpora of proteins that are continuously updated by the biology community.
The dynamic nature of these datasets motivates the need for continual learning, not
only to keep up with the ever-growing dataset sizes, but also as an opportunity to
take advantage of the temporal meta-information that is created during this process.
As a result, we introduce the Continual Pretraining of Protein Language Models
(CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on
pLMs. Specifically, we curate a sequence of protein datasets from the UniProt
database spanning 8 years and define metrics to
assess the performance of pLMs on diverse protein understanding tasks. We evaluate
several methods from the continual learning literature, including replay,
unlearning, and plasticity-based methods, some of which have never been applied to
models and data of this scale. Our findings reveal that incorporating temporal
meta-information improves the perplexity over training on the latest snapshot of the
database by up to $20\%$, and several continual
learning-based methods outperform naive continual pretraining. The CoPeP benchmark
presents an exciting opportunity for studying these methods at scale on an
impactful, real-world application.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 21674
Loading