Probing the Embedding Space of Protein Foundation Models through Intrinsic Dimension Analysis

Published: 13 Oct 2024, Last Modified: 01 Dec 2024AIDrugX PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundation model, intrinsic dimension, ESM, MPNN, finetuning, pretraining
Abstract: Protein foundation models produce embeddings that are valuable for various downstream tasks, yet the structure and information content of these embeddings remain poorly understood, particularly in relation to diverse pre-training tasks and input modalities. We apply intrinsic dimension ($I_d$) analysis to quantify the complexity of protein embeddings from several widely used models, including ESM-2, ESM-IF, ProstT5, and ProteinMPNN. We also employ $I_d$ correlation ($I_d$Cor) to measure the shared information between different embeddings. Our results reveal a universality in protein embeddings, with similar $I_d$ scales across models and strong correlations between protein and residue embeddings. We observe significant redundancy, with $I_d$ values much smaller than the original embedding dimensions. We also show that models capture both spatial and sequential long-range correlation, with correlation decay rate differing based on the input modalities and pre-training tasks. Lastly, we analyze mutant embeddings, revealing that mutations cluster effectively by site, and fine-tuning further reduces the $I_d$ to capture task-specific representations.
Submission Number: 135
Loading