Automated distillation of genomic equations governing single cell gene expression

Published: 28 Oct 2023, Last Modified: 02 Dec 2023NeurIPS2023-AI4Science PosterEveryoneRevisionsBibTeX
Keywords: single-cell gene expression, symbolic regression, neural architecture search, interpretable AI
TL;DR: We distill genomic equations relating sequence classes to gene expression from a neural network using symbolic regression
Abstract: Gene expression is an essential cellular process that is controlled by a complex and orchestrated regulatory network of transcription factors and epigenetic modifications. The advancement in single-cell RNA sequencing enables the investigation of gene expression control at an unprecedented fine resolution and large scale. Yet, understanding the sequence determinants underlying distinct primary cell types remains elusive and challenging. While deep neural networks have shown strong performance in predicting gene expression, the lack of meaningful explanations of predictions, especially in systematic understanding of the molecular mechanisms, motivates the search for more transparent models. We present an automated model that predicts gene expression from genetic sequences while providing both strong performance and direct interpretations of predictions. Our model combines a pre-trained genetic sequence class model and neural architecture search with symbolic regression to distill explainable genomic equations. We applied our method to an in-house human pituitary (a specialized gland in the brain that controls the endocrine system) single-cell gene expression data. The distilled genomic equation prediction accuracy (Pearson r=0.713) is comparable to other explainable models, without artificially introducing strong inductive bias that may not hold for the complex and potentially non-linear cellular system. The genomic equations shed light on how sequence classes interact and regulate the cell type-specific, finely-controlled transcriptomic program in the human endocrine system. To our knowledge, this is the first attempt at distilling genomic equations from neural networks using symbolic regression.
Submission Track: Original Research
Submission Number: 100