Interpretable Feature Engineering for Nanopore Sequencing Basecalling: Learning Biophysical Patterns in Pore Models

Interpretable Feature Engineering for Nanopore Sequencing Basecalling: Learning Biophysical Patterns in Pore Models

15 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Pore Models, Biophysical Patterns, Interpretable Feature

Abstract: Nanopore sequencing has emerged as a transformative platform for long-read DNA analysis, yet state-of-the-art basecallers rely on opaque deep learning models that limit interpretability and hinder systematic improvement. Here we present a proof-of-concept study demonstrating that interpretable, biophysically motivated feature engineering can capture key determinants of nanopore signals with competitive accuracy. Using the ONT R9.4 pore model, we construct single-nucleotide and pairwise interaction features and apply LASSO regularization to identify 50 informative predictors from an initial pool of 420. The resulting linear model reduces mean squared error by 87\% compared with one-hot encoding and outperforms a two-layer neural network baseline, while providing mechanistic insights into signal modulation at the pore constriction. On synthetic homopolymers, our approach achieves a 96\% error reduction, though limited sample size prevents strong conclusions. These findings highlight that interpretable models can not only approach the performance of black-box architectures but also elucidate the underlying physics of nanopore sequencing. While current results are restricted to noise-free synthetic data, this work outlines a path toward transparent, auditable, and efficient basecalling frameworks with potential relevance for both research and clinical applications.

Submission Number: 193

Loading