Abstract: Open genomic regions, being accessible to regulatory proteins, could act as the on/off switch or amplifier/attenuator of gene expression, and thus reflects the defining characteristics of cell types. Many previous models make predictions from the sequence to the regulatory region, but the interaction between regulatory regions and genes could be complex and differ between cell types. Moreover, current models usually only perform well on the cell types in the training set, which are not generalizable to data-scarce scenarios. In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. Specifically, we simultaneously take the 1d sequence of genome data and a 2d matrix of (transcription factors × regions) as the input, where three pre-training tasks are proposed to improve the robustness and generalizability of our model. We pre-train our model on the ATAC-seq dataset with 17 million gene sequences. We evaluate our GeneBERT on various downstream tasks, including promoter prediction, transaction factor binding sites prediction, disease risks estimation, and RNA-Splicing. Extensive experiments demonstrate the effectiveness of multi-modal and self-supervised pre-training for large-scale genome data.
Track: Original Research Track