PMI-guided Masking Strategy to Enable Few-shot Learning for Genomic Applications

Soumyadeep Roy; Jonas Wallat; Sowmya S Sundaram; Niloy Ganguly

PMI-guided Masking Strategy to Enable Few-shot Learning for Genomic Applications

Soumyadeep Roy, Jonas Wallat, Sowmya S Sundaram, Niloy Ganguly

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: gene sequence modeling, few-shot, MLM masking

TL;DR: PMI-masking in MLMs helps to achieve better few-shot classification performance in gene sequence modeling applications.

Abstract: Learning effective gene representations is of great research interest. Lately, large-scale language models based on the "transformer" architecture, such as DNABert and LOGO, have been proposed to learn gene representations from the Human Reference Genome. Although these large language models outperform previous approaches, currently, no study empirically determined the best strategy for representing gene sequences as tokens. Therefore, the uniform random masking strategy, which is the default during the pretraining of such masked language models, might lead to pretraining inefficiency, resulting in suboptimal downstream task performance in the few-shot setting. However, good few-shot performance is critical, with dataset sizes in (personalized) medicine often not exceeding a couple of hundred data points. In this paper, we develop a novel strategy to adapt "Pointwise Mutual Information (PMI) masking" used previously in the NLP setting to the domain of gene sequence modeling. PMI-masking masks spans of tokens that are more likely to co-occur, forming a statistically relevant span. First, we learn a vocabulary of tokens with a high PMI score from our pretraining corpus (the "Human Reference Genome"). Next, we utilize this side information in the following step and train our model by masking tokens based on PMI scores. In extensive experiments, we evaluate the effectiveness of the PMI-masking strategy on two baseline models of DNABert and LOGO, over three benchmark datasets (two on promoters and one on enhancer), and on a variety of few-shot settings. We observe that our PMI-masking-guided baseline models substantially outperform the baseline models. We further observe that almost all the top-ranked DNA tokens in terms of PMI score are closely associated with existing "conserved DNA sequence motifs".

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Machine Learning for Sciences (eg biology, physics, health sciences, social sciences, climate/sustainability )

Supplementary Material: zip

24 Replies

Loading