Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

ICLR 2026 Conference Submission19792 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep Learnining, gLMs, BEND, benchmark, WebDataset, data shuffling

TL;DR: Seemingly minor implementation details can significantly compromise benchmark validity.

Abstract: Seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4\% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.

Primary Area: datasets and benchmarks

Submission Number: 19792

Loading