Keywords: Deep Learnining, gLMs, BEND, benchmark, WebDataset, data shuffling
TL;DR: Seemingly minor implementation details can significantly compromise benchmark validity.
Abstract: Seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4\% for identical models.
The problem stems from inadequate data shuffling interacting with domain specific data characteristics.
Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings.
We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.
Primary Area: datasets and benchmarks
Submission Number: 19792
Loading