Understanding DNA Discrete Diffusion for Engineering Regulatory DNA Sequences

Published: 05 Mar 2025, Last Modified: 16 Apr 2025ICLR 2025 AI4NA PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 6 pages)
Keywords: DNA, DNN, Discrete Diffusion, Genomics
TL;DR: This study investigates DNA discrete diffusion model's behavior in low-data regimes, conditional generation, and sequence evolution patterns, revealing opportunities for optimization in regulatory DNA sequence design.
Abstract: Engineering regulatory DNA sequences with precise activity levels remains a major challenge in medicine and biotechnology due to the vast combinatorial space of possible sequences and the complex regulatory grammars governing gene expression. DNA discrete diffusion (D3) has emerged as a promising approach for learning these distributions and generating biologically relevant sequences, yet several key aspects of its capabilities remain unexplored. Here we systematically investigate D3’s performance in biologically relevant, understudied scenarios. First, we demonstrate that D3 maintains robust performance even with limited training data, highlighting its practical utility in real-world applications where data is scarce. Second, we extend D3’s conditional generation capabilities for categorical data, employing classifier-free guidance to improve the quality and specificity of generated sequences. Third, we analyze sequence trajectories during the diffusion process, providing insights into how discrete diffusion navigates the sequence-function landscape. Together, these findings expand our understanding of D3’s strengths and limitations, while introducing new methodological advances for engineering functional regulatory DNA sequences.
Supplementary Material: pdf
Submission Number: 34
Loading