Overlap-based genome assembly from variable-length reads

Joseph Hui, Ilan Shomorony, Kannan Ramchandran, Thomas A. Courtade

2016 (modified: 04 Nov 2022)ISIT 2016Readers: Everyone

Abstract: Recently developed high-throughput sequencing platforms can generate very long reads, making the perfect assembly of whole genomes information-theoretically possible [1]. One of the challenges in achieving this goal in practice, however, is that traditional assembly algorithms based on the de Bruijn graph framework cannot handle the high error rates of long-read technologies. On the other hand, overlap-based approaches such as string graphs [2] are very robust to errors, but cannot achieve the theoretical lower bounds. In particular, these methods handle the variable-length reads provided by long-read technologies in a suboptimal manner. In this work, we introduce a new assembly algorithm with two desirable features in the context of long-read sequencing: (1) it is an overlap-based method, thus being more resilient to read errors than de Bruijn graph approaches; and (2) it achieves the information-theoretic bounds even in the variable-length read setting.

0 Replies