Elucidating the Design Space of Multimodal Protein Language Models

Cheng-Yen Hsieh; Xinyou Wang; Daiheng Zhang; Dongyu Xue; Fei Ye; Shujian Huang; Zaixiang Zheng; Quanquan Gu

Elucidating the Design Space of Multimodal Protein Language Models

Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We elucidated the design choices of multimodal protein language model to address its limitations on structural modeling.

Abstract: Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models. Project page and code: https://bytedance.github.io/dplm/dplm-2.1.

Lay Summary: Proteins are essential molecules of life, and understanding both their 3D structures and amino acid sequences is crucial for things like drug discovery and protein designs. Recent computer models, known as multimodal protein language models (PLMs), learn to generate both by observing protein sequences and protein 3D structures that are converted into small symbolic units called "tokens", a process called "tokenization". However, this process causes substantial loss of structural details, limiting the models' ability to accurately predict protein structures. In this paper, we address this challenge by elucidating effective design methods for the PLM. We propose several methods, including training strategies that help the model capture structural patterns more effectively, architectural designs that are tailored for proteins, and exploring protein data with multiple chains that carry rich structural arrangements and interaction scenarios. Our evaluations show that these designs effectively improve the accuracy of multimodal PLMs on predicting structures, with fewer model parameters and hence less computation overhead than prior baselines.

Primary Area: Applications->Chemistry, Physics, and Earth Sciences

Keywords: Multimodal protein language model, structure tokenization, improved structure generative modeling, bitwise modeling, hybrid data-space modeling, structure-aware architectures, representation learning, multimer

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://bytedance.github.io/dplm/dplm-2.1/

Submission Number: 7470

Loading