NeurALigner: Scalable and Robust DNA Sequence Alignment via Embedding-Based Similarity Search

14 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: DNA Alignment, DNA Language Model
Abstract: DNA sequence alignment is a fundamental task in genomics. Existing aligners rely on the seed–chain–align paradigm, which achieves high efficiency but struggles with sequencing errors and genetic variation. Moreover, most methods remain CPU-based and are poorly suited to large-scale GPU acceleration, limiting their utility in time-sensitive settings. In this paper, we present NeurALigner (NAL), a GPU-accelerated alignment framework that integrates DNA sequence models with vector database retrieval. Instead of exact string matching, NAL encodes DNA subsequences into embeddings and reformulates seed matching as a fast similarity search in feature space, providing robustness to mismatches caused by sequencing errors or genetic variations. The learned embeddings enable the use of longer seeds, raising specificity in matching and improving efficiency. Furthermore, an adaptive seeding strategy dynamically adjusts the number of seeds, balancing efficiency and accuracy. Together, these innovations enable scalable, mismatch-tolerant alignment with high specificity and strong GPU performance.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 5216
Loading