- TL;DR: We apply RNN to solve the biological problem of chromatin folding patterns prediction from epigenetic marks and demonstrate for the first time that utilization of memory of sequential states on DNA molecule is significant for the best performance.
- Abstract: The recent expansion of machine learning applications to molecular biology proved to have a significant contribution to our understanding of biological systems, and genome functioning in particular. Technological advances enabled the collection of large epigenetic datasets, including information about various DNA binding factors (ChIP-Seq) and DNA spatial structure (Hi-C). Several studies have confirmed the correlation between DNA binding factors and Topologically Associating Domains (TADs) in DNA structure. However, the information about physical proximity represented by genomic coordinate was not yet used for the improvement of the prediction models. In this research, we focus on Machine Learning methods for prediction of folding patterns of DNA in a classical model organism Drosophila melanogaster. The paper considers linear models with four types of regularization, Gradient Boosting and Recurrent Neural Networks for the prediction of chromatin folding patterns from epigenetic marks. The bidirectional LSTM RNN model outperformed all the models and gained the best prediction scores. This demonstrates the utilization of complex models and the importance of memory of sequential DNA states for the chromatin folding. We identify informative epigenetic features that lead to the further conclusion of their biological significance.
- Code: https://www.dropbox.com/sh/izy4exnkosd309o/AACl7N8xZX1X13EOSwC_mNj1a?dl=0
- Keywords: Machine Learning, Recurrent Neural Networks, 3D chromatin structure, topologically associating domains, computational biology.