Interpretable machine learning models for single-cell ChIP-seq imputation

Steffen Albrecht, Tommaso Andreani, Miguel A. Andrade-Navarro, Jean-Fred Fontaine

Published: 20 Dec 2019, Last Modified: 28 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: h3>Abstract</h3> <h3>Motivation</h3> <p>Single-cell ChIP-seq (scChIP-seq) analysis is challenging due to data sparsity. High degree of data sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from ENCODE to impute missing protein-DNA interacting regions of target histone marks or transcription factors.</p><h3>Results</h3> <p>Imputations using machine learning models trained for each single cell, each target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real data. Results on simulated data show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways. An imputation method that allows the interpretation of the underlying models facilitates users to gain an even deeper understanding of individual cells and, consequently, of sparse scChIP-seq datasets.</p><h3>Availability and implementation</h3> <p>Our interpretable imputation algorithm was implemented in Python and is available at https://github.com/salbrec/SIMPA</p>
Loading