Blind Biological Sequence Denoising with Self-Supervised Set Learning

Published: 31 Jan 2024, Last Modified: 31 Jan 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy obser- vations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the mid- point of the subreads in both the latent and sequence spaces. This set embedding represents the “average” of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of ≤ 6 subreads with 17% fewer errors and large reads of > 6 subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url:
Changes Since Last Submission: Below we have listed the changes since last submission. 1. Updated our paper title to “Blind Denoising of Biological Sequences...” to better situate our investigation into the application of our method in biological specific contexts. 2. Added additional context and related work that better situates our problem setting relative to existing work. 3. Added 4 additional baseline methods, including 3 MSA based methods and one median string algorithm. 4. Improved our synthetic data generation process by using PBSIM2 [1] which accurately models the ONT sequencer and error profile used to sequence our datasets of scFv antibodies. 5. Removed motivation of LOO edit as an upper bound 6. Adjusted the formulation of fractal entropy to use the KL divergence between the smoothed densities from the observed read and denoised sequence 7. Added experiments on a model using a mean pooling aggregation operation in addition to the set transformer aggregation operation. 8. Larger graphs with all baselines rerun on both PBSIM2 simulated data and scFv antibody data. 9. Additional ablations on aggregation method, MMD kernel, and number of reads seen during training. 10. Added additional discussion of the limitations of the empirical evaluation. In the Introduction: "We focus on the antibody light chains in this work since their ground truth sequences exhibit fewer variations and thus more similarity between reads, making self-supervision a natural choice." In the Experimental Setup: "The scFv library was experimentally designed with more variation in heavy chain ground truth sequences compared to light chain ground truth sequences. In this paper we focus on the light chain sequences since the higher similarity between reads lends itself well to the self-supervised method we propose. Since the lack of ground truth sequences means we cannot explicitly measure edit distance, we use our proposed LOO edit and fractal entropy metrics to compare the quality of denoised sequences instead." [1] PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Yukiteru Ono, Kiyoshi Asai, Michiaki Hamada. 2021.
Assigned Action Editor: ~Andriy_Mnih1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1534