ProtDiff: Function-Conditioned Masked Diffusion Models for Robust Directed Protein Generation

Published: 12 Oct 2024, Last Modified: 11 Nov 2024GenAI4Health PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion, Masked Diffusion, Absorbing State Diffusion, Protein Modeling, Function-Conditional Protein Generation, Classifier-Free Guidance
TL;DR: SUBS parameterization-based Masked diffusion language models with function token conditioning show state-of-the-art performance over autoregressive models in conditional de novo protein sequence generation
Abstract: The development of function-specific protein sequences is a critical challenge in the field of drug discovery and protein design. Current methods, including structure-based and protein language models, face limitations due to biases in structural predictions, the requirement for extensive fine-tuning, and difficulty in zero-shot generalization to unseen functional tasks. To address these challenges, we propose ProtDiff, a novel protein sequence diffusion model conditioned on function tokens. Unlike traditional protein generative models that rely on multimodal inputs or extensive fine-tuning, ProtDiff utilizes a masked diffusion language modeling approach with classifier-free guidance to generate protein sequences in a zero-shot manner solely based on a predefined function token vocabulary. The idffusion process leverages an absorbing state forward process where protein sequences transition to a masked state, allowing a small transformer-based backbone model to iteratively reconstruct sequences. By training on InterPro datasets and employing a classifier-free diffusion guidance mechanism, ProtDiff demonstrates state-of-the-art performance in functional adaptation and de novo sequence generation tasks compared to existing models. Evaluation on both function-matching and de novo generation benchmarks shows that ProtDiff effectively generates novel, stable protein sequences that conform to specified functional constraints, showing comparable results to autoregressive model equivalents with higher parameter counts. Our results indicate that ProtDiff not only advances the state-of-the-art in protein design but also opens new avenues for explainable and targeted protein generation for drug discovery applications.
Submission Number: 11
Loading