Methylation-Aware Embedding Geometry Emerges from Bisulfite Pretraining in DNA Language Models

Jiajie Xiao; Salwan Butrus; Nathan Hunkapiller

Methylation-Aware Embedding Geometry Emerges from Bisulfite Pretraining in DNA Language Models

Jiajie Xiao, Salwan Butrus, Nathan Hunkapiller

Published: 04 Mar 2026, Last Modified: 07 Mar 2026ICLR 2026 Workshop LMRL PosterEveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Track: tiny / short paper (2-4 pages excluding references; extended abstract format)

Keywords: DNA methylation, bisulfite sequencing, genomic language models, continual pretraining, epigenetic learning, representation geometry, disease detection, foundation models for biology, methylation-aware representation, representation learning, LLM, model interoperability, domain adaptation

TL;DR: Continual pretraining genomic language models on bisulfite sequencing data makes them methylation-aware, revealing geometric signatures in their embeddings that encode epigenetic state without altering the model’s architecture.

Abstract: DNA methylation encodes regulatory information beyond the DNA sequence, but most genomic language models (gLMs) miss this important modality because they are pretrained on native DNA only. We test whether a widely used DNA checkpoint can be retrofitted into a methylation-aware model by continual pretraining on bisulfite sequencing (BS-seq) reads, where methylation is implicitly encoded into token identities via C$\rightarrow$T conversion. Rather than proposing a new architecture, we ask for compact, interpretable evidence that methylation is encoded in representation space. Using DNABERT2 continually pretrained on a multi-tissue BS-seq atlas, we show two simple geometric diagnostics: (i) per-read embedding norms become bimodal and align with hypo/hypermethylated contexts, and (ii) cosine distances between genomically matched tumor--normal read pairs increase substantially after BS-seq adaptation, relative to the native checkpoint. These results suggest that simple BS-seq retrofitting can endow a standard DNA gLM with biologically meaningful, increased, label-light epigenetic sensitivity.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 55

Loading