ReFoRM: Reliable Per-Base Error Prediction under Distribution Shifts in DNA Storage

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: DNA Storage, Per-base Error Prediction, Distribution Shift, Feature Refinement, Condition-Aware Calibration
Abstract: Reliable prediction under distribution shift remains a core challenge in machine learning. In DNA storage pipelines, base-level errors are pervasive: synthesis introduces insertions and deletions, sequencing produces systematic substitutions, and amplification reshapes error distributions under stress conditions such as aging and PCR. Existing approaches often rely on scaling large sequence models or concatenating handcrafted descriptors, but both strategies suffer from redundancy, instability, and poor calibration. To address these issues, we propose ReFoRM, a lightweight framework for reliable prediction under distribution shift. Specifically, it consists of three components: (i) feature refinement, which selects a compact and informative subset from an over-complete feature pool; (ii) cross-attentive fusion, which integrates refined descriptors with embeddings in a stable and balanced manner; and (iii) condition-aware calibration, which adjusts predictive confidence under distribution shifts. We evaluate ReFoRM on DNA storage error prediction, where descriptors such as GC content, homopolymer length, and structural accessibility provide a natural testbed with perturbations. Across digital twin and simulated datasets, ReFoRM achieves PR–AUC 0.9278 (in); 0.9291 (LOCO) and ECE 0.0944 (in); 0.0968 (LOCO), demonstrating strong extensibility and reliability under distribution shifts.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 10265
Loading