Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Track: tiny / short paper (2-4 pages excluding references; extended abstract format)
Keywords: DNA methylation, bisulfite sequencing, genomic language models, continual pretraining, epigenetic learning, representation geometry, disease detection, foundation models for biology, methylation-aware representation, representation learning, LLM, model interoperability, domain adaptation
TL;DR: Continual pretraining genomic language models on bisulfite sequencing data makes them methylation-aware, revealing geometric signatures in their embeddings that encode epigenetic state without altering the model’s architecture.
Abstract: DNA methylation encodes regulatory information beyond the DNA sequence, but most
genomic language models (gLMs) miss this important modality because they are
pretrained on native DNA only. We test whether a widely used DNA checkpoint can
be retrofitted into a methylation-aware model by continual pretraining on
bisulfite sequencing (BS-seq) reads, where methylation is implicitly encoded
into token identities via C$\rightarrow$T conversion. Rather than proposing a
new architecture, we ask for compact, interpretable evidence that methylation is
encoded in representation space. Using DNABERT2 continually pretrained on a
multi-tissue BS-seq atlas, we show two simple geometric diagnostics: (i)
per-read embedding norms become bimodal and align with hypo/hypermethylated
contexts, and (ii) cosine distances between genomically matched tumor--normal
read pairs increase substantially after BS-seq adaptation, relative to the
native checkpoint. These results suggest that simple BS-seq retrofitting can
endow a standard DNA gLM with biologically meaningful, increased, label-light
epigenetic sensitivity.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 55
Loading