Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomy Datasets

Srivatsava Daruru, Sankari Dhandapani, Gunjan Gupta, Ilian Iliev, Weijia Xu, Paul A. Navrátil, Nena M. Marin, Joydeep Ghosh

Published: 01 Jan 2010, Last Modified: 20 May 2025ICDM Workshops 2010EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Terascale astronomical datasets have the potential to provide unprecedented insights into the origins of our universe. However, automated techniques for determining regions of interest are a must if domain experts are to cope with the intractable amounts of simulation data. This paper addresses the important problem of locating and tracking high density regions in space that generally correspond to halos and sub-halos and host galaxies. A density based, mode following clustering method called Automated Hierarchical Density Shaving (Auto-HDS) is adapted for this application. Auto-HDS can detect clusters of different densities while discarding the vast majority of background data. Two alternative parallel implementations of the algorithm, based respectively on the dataflow computational model and on Hadoop/ MapReduce functional programming constructs, are realized and compared. Based on runtime performance, scalability across compute cores and across increasing data volumes, we demonstrate the benefits of fine grain parallelism. The proposed distributed and multithreaded AutoHDS clustering algorithm is shown to produce high quality clusters, be computationally efficient, and scalable from 1 through 1024 compute-cores.