The Genomics Long-Range Benchmark: Advancing DNA Language Models

Evan Trop; Yair Schiff; Edgar Mariano Marroquin; Chia Hsiang Kao; Aaron Gokaslan; McKinley Polen; Mingyi Shao; Bernardo P de Almeida; Thomas PIERROT; Yang I Li; Volodymyr Kuleshov

The Genomics Long-Range Benchmark: Advancing DNA Language Models

Evan Trop, Yair Schiff, Edgar Mariano Marroquin, Chia Hsiang Kao, Aaron Gokaslan, McKinley Polen, Mingyi Shao, Bernardo P de Almeida, Thomas PIERROT, Yang I Li, Volodymyr Kuleshov

29 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: DNA, Language Models, Genomics, Benchmark

Abstract: The advent of language models (LMs) in genomics necessitates benchmarks that can assess models’ capabilities and limitations. In contrast to protein models, DNA LMs can be used to study non-coding regions of the genome and must account for unique challenges, especially interactions across long sequence lengths. However, existing benchmarks for DNA LMs are defined over short sequence datasets and can involve tasks that are often not considered to be biologically meaningful. Here, we present the Genomics Long-Range Benchmark (LRB), which focuses on biologically meaningful tasks and supports long-range contexts. We complement our benchmark with fine-tuning recipes that meaningfully improve performance and affect model evaluation. We evaluate DNA LMs across nine compiled tasks and observe that DNA LMs achieve competitive performance relative to supervised baselines on several tasks (e.g., genome annotation), but there remains a significant gap in domains, such as variant effect and gene expression prediction. Additionally, we introduce a visualization tool to examine model performance split by various genomic properties. Lastly, we present methods for context-length extrapolation of transformer-based models that enable studying the effect of context length on DNA LM performance. The Genomics LRB is publicly available on Hugging Face: https://hf.co/datasets/InstaDeepAI/genomics-long-range-benchmark.

Supplementary Material: pdf

Submission Number: 2081

Loading