A Look at the Isotropy of Pretrained Protein Language Models

A Look at the Isotropy of Pretrained Protein Language Models

ICML 2025 Workshop FM4LS Submission70 Authors

Published: 12 Jul 2025, Last Modified: 12 Jul 2025FM4LS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Protein Language Models, Isotropy

TL;DR: Most protein language models are isotropic.

Abstract: Large pretrained language models have transformed natural language processing, and their adaptation to protein sequences---viewed as strings of amino acid characters---has advanced protein analysis. However, the distinct properties of proteins, such as variable sequence lengths and lack of word-sentence analogs, necessitate a deeper understanding of protein language models (LMs). We investigate the isotropy of protein LM embedding spaces using average pairwise cosine similarity and the IsoScore method, revealing that models like ProtBERT and ProtXLNet are highly anisotropic, utilizing only 2--14 dimensions for global and local representations. In contrast, multi-modal training in ProteinBERT, which integrates sequence and gene ontology data, enhances isotropy, suggesting that diverse biological inputs improve representational efficiency. We also find that embedding distances weakly correlate with alignment-based similarity scores, particularly at low similarity.

Submission Number: 70

Loading