Keywords: protein language models, data leakage, generalization, domain annotation, ESM-2
TL;DR: We evaluate the out of distribution generalization of protein language models by introducing data leakage in a downstream domain annotation task using strict train, test, validation splits.
Abstract: Protein language models (pLMs) are increasingly used for protein function prediction tasks such as detecting and annotating homologous domains in sequences. However, because they are pretrained on such a broad sample of known protein sequence space, it becomes difficult to construct downstream train/test splits that are truly independent of pretraining data and therefore to assess whether downstream performance reflects genuine generalization. We study this question in a controlled setting by constructing large-scale train, validation, and test splits with no detectable cross-split homology and using them for both pLM pretraining and downstream evaluation. Holding architecture, compute, and downstream training fixed, we vary only the amount of overlap between pretraining and test data and measure its effect on domain annotation sensitivity. Pretraining overlap substantially increases test-set sensitivity even when training loss, validation perplexity, and validation-set downstream performance remain nearly unchanged, showing that overlap with pretraining data can inflate apparent performance without improving broader generalization. We also find that domain-relevant signal emerges early during masked language model pretraining.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 42
Loading