Exploring the potential of genetic variation and zygosity in DNA language models

Published: 05 Mar 2025, Last Modified: 07 May 2025MLGenX 2025 TinyPapersEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny paper track (up to 4 pages)
Abstract: Advancements in DNA language models (DNA-LMs) have improved phenotype prediction from DNA sequences, yet the roles of zygosity and genetic variation (GV) remain underexplored. In this study we quantify their effects on gene expression prediction as an example of variation-sensitive phenotype, showing that baseline models benefit from zygosity- and GV-aware encoding, while DNA-LMs struggle to utilize them. These findings underscore the need for integrating biologically meaningful features like zygosity and GV in DNA-LM pretraining to better capture genetic diversity and improve variant interpretation.
Submission Number: 2
Loading