Keywords: long-range DNA benchmark, long-range DNA modeling, long-range DNA foundation models
TL;DR: Proposing a benchmark dataset for long-range DNA prediction tasks
Abstract: Modeling long-range DNA dependencies is crucial for understanding genome structure and function across a wide range of biological contexts in health and disease. However, effectively capturing the extensive long-range dependencies between DNA sequences, spanning millions of base pairs as seen in tasks such as three-dimensional (3D) chromatin folding, remains a significant challenge. Additionally, a comprehensive benchmark suite for evaluating tasks reliant on long-range dependencies is notably absent. To address this gap, we introduce DNALONGBENCH, a benchmark dataset spanning five important genomics tasks that consider long-range dependencies up to 1 million base pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signal. To comprehensively assess DNALONGBENCH, we evaluate the performance of five baseline methods: a task-specific expert model, a convolutional neural network (CNN)-based model, and three fine-tuned DNA foundation models -- HyenaDNA, Caduceus-Ph and Caduceus-PS. We envision DNALONGBENCH having the potential to become a standardized resource that facilitates comprehensive comparisons and rigorous evaluations of emerging DNA sequence-based deep learning models that consider long-range dependencies.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5418
Loading