Keywords: protein embeddings, neural compression, multimodal generation
Abstract: ESMFold learns a joint latent space of sequence and structure while requiring only sequence as input. However, the latent space of ESMFold is disorganized and we find pathologies, similar to those observed in large language models, that render these models unusable for multimodal representation learning. Meanwhile, latent diffusion in both continuous and discrete spaces have improved efficiency and performance in image and multimodal generation, but are built on an abundance of knowledge on autoencoders for images.
To create a protein encoder which captures structural and functional information for generative modeling in the latent space, we create CHEAP (Compressed Hourglass Embedding Adaptations of Proteins) representations, and find that the channel dimension of ESMFold latent spaces can be compressed by up to $256\times$ while retaining rich structural, sequence, and functional information, as demonstrated on protein understanding benchmarks and reconstruction performance.
Submission Number: 25
Loading