Neural Networks beyond explainability: Selective inference for sequence motifs
Abstract: Over the past decade, neural networks have been successful at making predictions from biological sequences, especially in the context of regulatory genomics. As in other fields of deep learning, tools have been devised to extract features such as sequence motifs that can explain the predictions made by a trained network. Here we intend to go beyond explainable machine learning and introduce SEISM, a selective inference procedure to test the association between these extracted features and the predicted phenotype. In particular, we discuss how training a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score. We adapt existing sampling-based selective inference procedures by quantizing this selection over an infinite set to a large but finite grid. Finally,we show that sampling under a specific choice of parameters is sufficient to characterize the composite null hypothesis typically used for selective inference - a result that goes well beyond our particular framework. We illustrate the behavior of our method in terms of calibration, power and speed and discuss its power/speed trade-off with a simpler data-split strategy. SEISM paves the way to an easier analysis of neural networks used in regulatory genomics, and to more powerful methods for genome wide association studies (GWAS).
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We sincerely express our gratitude to the action editors and reviewers for their time and feedback. According to action editors minor comment, we removed the reference to the context of regulatory genomics.
Supplementary Material: zip
Assigned Action Editor: ~Ole_Winther1
Submission Number: 688