Beyond Nativeness: Viral Proteins in Protein Language Models
Keywords: protein language models, viral proteins, embedding geometry, representation learning, biological sequence modeling, viral classification, scaling laws, underrepresented data, sequence embeddings, bioinformatics, machine learning for biology
TL;DR: Protein language models place viral proteins along a nativeness axis between cellular and random sequences, while embeddings still preserve viral-specific signal beyond perplexity.
Abstract: Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked-reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 33
Loading