Disentangling Protein Family Signals in Protein Language Models: Composition or Motifs?

Published: 06 Oct 2025, Last Modified: 06 Oct 2025NeurIPS 2025 2nd Workshop FM4LS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: protein language models, amino acid composition, protein families, embeddings, representation learning
TL;DR: Protein family clustering in LM embeddings is largely explained by global amino-acid composition rather than order-dependent sequence patterns.
Abstract: We test whether family-level separation in protein language model (pLMs) embeddings persists after controlling for amino-acid composition. For six Pfam families and four models (ESM-2, ProtBERT, ProtXLNet, ProteinBERT), we compute layer-wise within-family (In) and between-family (Out) cosine similarities for true sequences and composition-preserving shuffles. We report a ratio-based fidelity (In/Out) and a difference metric ∆ = In−Out, and visualize the geometry with t-SNE against a pooled negative bank. Across families and models, shuffled curves closely track true curves in both fidelity and ∆, and frequently match or exceed them. ProtXLNet’s fidelity rises with depth, but the shuffled curve is typically comparable or higher; ProtBERT’s mid-layer spike is mirrored by shuffles; ESM-2 and ProteinBERT are weak overall. t-SNE strips show compact clusters for both true and shuffled sequences, with negatives separated. These results indicate that amino-acid composition accounts for much of the apparent family fidelity in current embeddings, motivating composition-controlled baselines and reporting of both ratio- and difference-based metrics
Submission Number: 83
Loading