Empowering Protein Language Model for Sequence-Structure Co-Generation with Continuous Structure Tokens

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: ai for science, protein language model, protein sequence-structure co-generation
Abstract: Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has enabled fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structure to accommodate the language modeling framework, which inevitably results in fine-grained information loss and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion generation head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for sequence-structure joint-modeling in multimodal generative pLMs. The proposed model captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive empirical results show that our models achieve competitive performance in unconditional sequence–structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing on par with state-of-the-art multimodal pLMs despite being developed under limited computational resources. It underscores the viability of jointly modeling discrete categorical and continuous arbitrary distributions using shared parameters within a pLM, pointing to an alternative and promising direction of progress for multimodal pLMs.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 3427
Loading