Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models

Dihan Zheng; Bo Huang

Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models

Dihan Zheng, Bo Huang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Fluorescence microscopy is ubiquitously used in cell biology research to characterize the cellular role of a protein. To help elucidate the relationship between the amino acid sequence of a protein and its cellular function, we introduce CELL-Diff, a unified diffusion model facilitating bidirectional transformations between protein sequences and their corresponding microscopy images. Utilizing reference cell morphology images and a protein sequence, CELL-Diff efficiently generates corresponding protein images. Conversely, given a protein image, the model outputs protein sequences. CELL-Diff integrates continuous and diffusion models within a unified framework and is implemented using a transformer-based network. We train CELL-Diff on the Human Protein Atlas (HPA) dataset and fine-tune it on the OpenCell dataset. Experimental results demonstrate that CELL-Diff outperforms existing methods in generating high-fidelity protein images, making it a practical tool for investigating subcellular protein localization and interactions.

Lay Summary: Biologists often use specialized microscopes to take pictures of proteins inside cells, helping us understand what those proteins do. But creating these images is time-consuming, expensive, and usually limited to proteins we already know a lot about. We wondered if we could build a model to imagine what a protein looks like in a cell, just from its genetic sequence. We built CELL-Diff, an AI model that learns how protein sequences relate to their appearance in microscope images. Once trained, it can generate realistic images of a protein in the cell or suggest what the sequence might be if you show it a protein image. We trained CELL-Diff on large public datasets of protein images and found that it outperforms previous methods. This means scientists could use it to explore unfamiliar proteins, design new ones, or better understand how proteins interact in the cell.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Applications->Everything Else

Keywords: AI for biology, Image generation, Diffusion model, MultiModal learning, Cell biology

Submission Number: 8034

Loading