Discriminative protein sequence modelling with Latent Space Diffusion

Discriminative protein sequence modelling with Latent Space Diffusion

ICLR 2025 Workshop LMRL Submission73 Authors

12 Feb 2025 (modified: 18 Apr 2025)Submitted to ICLR 2025 Workshop LMRLEveryoneRevisionsBibTeXCC BY 4.0

Track: Full Paper Track

Keywords: Representation learning, protein, diffusion, sequence, autoencoder

Abstract:

We introduce a framework for protein sequence representation learning that decomposes the task between manifold learning and distributional modelling. Specifically we present a Latent Space Diffusion architecture which combines a protein sequence autoencoder with a denoising diffusion model operating on its latent space. We obtain a one-parameter family of learned representations from the diffusion model, in addition to the autoencoder’s latent representation. To address the challenge of identifying an appropriate latent space for diffusion, we propose and evaluate two autoencoder architectures: a homogeneous model forcing amino acids of the same type to be identically distributed in the latent space, and an inhomogeneous model employing a noise-based variant of masking.

Submission Number: 73

Loading