Contrastive RNA Representation Learning Through Maximizing Mutual Information Between Splice Isoforms

Published: 04 Mar 2024, Last Modified: 29 Apr 2024GEM PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Machine learning: computational method and/or computational results
Cell: I do not want my work to be considered for Cell Systems
Keywords: Contrastive learning, genomics, self-supervised learning, RNA, representation learning, deep metric learning, SimCLR, alternative splicing
TL;DR: Contrastive RNA representation learning through minimizing projection distance between splicing isoforms and homologous genes results in generalizable representations.
Abstract: In the face of rapidly accumulating genomic data, our understanding of the RNA regulatory code remains incomplete. Recent self-supervised methods in other domains have demonstrated the ability to learn rules underlying the data-generating process, such as sentence structure in language. Inspired by this, we extend contrastive learning techniques to genomic data by utilizing functional similarities between sequences generated through alternative splicing and gene duplication. We introduce IsoCLR, a model trained on a novel dataset with a contrastive objective, enabling the learning of generalized RNA isoform representations. We validate representation utility on downstream tasks such as RNA half-life and mean ribosome load prediction. Our pre-training strategy yields competitive results using linear probing across six tasks, along with up to a two-fold increase in Pearson correlation in low-data conditions. Importantly, our exploration of the learned latent space reveals that our contrastive objective yields semantically meaningful representations, underscoring its potential as a valuable initialization technique for RNA property prediction.
Submission Number: 76
Loading