Self-GenomeNet: Self-supervised Learning with Reverse-Complement Context Prediction for Nucleotide-level Genomics Data

Hüseyin Anil Gündüz; Martin Binder; Xiao-Yin To; René Mreches; Philipp C. Münch; Alice C McHardy; Bernd Bischl; Mina Rezaei

Self-GenomeNet: Self-supervised Learning with Reverse-Complement Context Prediction for Nucleotide-level Genomics Data

Hüseyin Anil Gündüz, Martin Binder, Xiao-Yin To, René Mreches, Philipp C. Münch, Alice C McHardy, Bernd Bischl, Mina Rezaei

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: Genome Sequence Analysis, Self-supervised Learning, Representation Learning, Application in Computational Biology

Abstract: We introduce Self-GenomeNet, a novel contrastive self-supervised learning method for nucleotide-level genomic data, which substantially improves the quality of the learned representations and performance compared to the current state-of-the-art deep learning frameworks. To the best of our knowledge, Self-GenomeNet is the first self-supervised framework that learns a representation of nucleotide-level genome data, using domain-specific characteristics. Our proposed method learns and parametrizes the latent space by leveraging the reverse-complement of genomic sequences. During the training procedure, we force our framework to capture semantic representations with a novel context network on top of intermediate features extracted by an encoder network. The network is trained with an unsupervised contrastive loss. Extensive experiments show that our method with self-supervised and semi-supervised settings is able to considerably outperform previous deep learning methods on different datasets and a public bioinformatics benchmark. Moreover, the learned representations generalize well when transferred to new datasets and tasks. The source code of the method and all the experiments are available at supplementary.

One-sentence Summary: We introduce, Self-GenomeNet, a novel contrastive self-supervised learning method for nucleotide-level genomic data which improves the quality of the learned representations and performance.

Supplementary Material: zip

17 Replies

Loading