DGCF: A Distributed Greedy Clustering Framework for Large-scale Genomic Sequences

Published: 01 Jan 2019, Last Modified: 16 May 2025BIBM 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Clustering is a very fundamental while time-consuming compute operation in biological sequence analysis. New sequencing technologies such as NGS and 3GS have dramatically increased both the dataset size and the length of a single read sequence. However, existing tools lack scalability for handling large-scale datasets as well as long sequences. A feasible solution to this problem is to use parallel and distributed systems. The efficient deployment of such systems, however, requires high parallelism in both software implementations as well as algorithmic optimizations. In this paper, we propose DGCF, a Distributed Greedy Clustering Framework which is capable to handle large-scale datasets and long sequences. Our framework adopts a greedy clustering strategy which overlaps communication with computation among many distributed computing nodes. We also design and implement a sparse suffix array (SSA)-based alignment algorithm that can support long sequences. Experiments show that our framework achieves near-linear speedups on a distributed memory cluster.
Loading