ODDS: Optimizing Data-Locality Access for Scientific Data Analysis

Jun Wang, Dezhi Han, Jiangling Yin, Xiaobo Zhou, Changjun Jiang

Published: 2020, Last Modified: 17 Apr 2025IEEE Trans. Cloud Comput. 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Whereas traditional scientific applications are computationally intensive, recent applications require more data-intensive analysis and visualization to extract knowledge from the explosive growth of scientific information and simulation data. As the computational power and size of compute clusters continue to increase, the I/O read rates and associated network for these data-intensive applications have been unable to keep pace. These applications suffer from long I/O latency due to the movement of “big data” from the network/parallel file system, which results in a serious performance bottleneck. To address this problem, we proposed a novel approach called “ODDS” to optimize data-locality access in scientific data analysis and visualization. ODDS leverages a distributed file system (DFS) to provide scalable data access for scientific analysis. Through exploiting the information of underlying data distribution in DFS, ODDS employs a novel data-locality scheduler to transform a compute-centric mapping into a data-centric one and enables each computational process to access the needed data from a local or nearby storage node. ODDS is suitable for parallel applications with dynamic process-to-data scheduling and for applications with static process-to-data assignment. To demonstrate the efficacy of our methods, we present and evaluate ODDS in the context of two state-of-the-art, scientific-analysis applications-mpiBLAST and ParaView-along with the Hadoop distributed file system (HDFS) across a wide variety of computing platform settings. In comparison to existing deployments using NFS, PVFS, or Lustre as the underlying storage systems, ODDS can greatly reduce the I/O cost and double overall performance.