Keywords: DNA, Representation Learning, Pre-Training, Foundation Model, Knowledge Graph, Biological Engineering
TL;DR: We explore knowledge graph embedding methods to enhance DNA representation learning in foundation models.
Abstract: Understanding the language of the genome remains a key challenge in biology, with pre-trained models such as DNABERT-2 achieving substantial advancement. These models leverage massive nucleotide sequences through a self-supervised learning paradigm, yet they often overlook the rich, structured knowledge already curated by human experts. Inspired by the knowledge-enhanced foundation models in other biological molecules (e.g., proteins and drugs), we introduce Knowledge Graph-Augmented DNABERT (KGA-DNABERT), augmenting the objective of masked language modeling with knowledge graph (KG) modeling. Specifically, we construct KGs by extracting factual triplets from GenomicKB, a comprehensive human genome database. In addition to DNABERT-2’s MLM, we incorporate six popular KG embedding methods to model the curated KG beyond sequence-level representations. We did not observe substantial benefits from incorporating KGs into DNA representation learning with the KGs tested here and attribute this to the insufficient coverage of the constructed KGs, as they represent only an excerpt of GenomicKB. This motivates us to explore further a better integration of KG for DNA representation learning.
Submission Number: 61
Loading