Abstract: Whereas traditional scientific applications are computationally intensive, recent applications require more data-intensive analysis and visualization to extract knowledge from the explosive growth of scientific information and simulation data. As the computational power and size of compute clusters continue to increase, the I/O read rates and associated network for these data-intensive applications have been unable to keep pace. These applications suffer from long I/O latency due to the movement of “big data” from the network/parallel file system, which results in a serious performance bottleneck. To address this problem, we proposed a novel approach called “ODDS” to optimize data-locality access in scientific data analysis and visualization. ODDS leverages a distributed file system (DFS) to provide scalable data access for scientific analysis. Through exploiting the information of underlying data distribution in DFS, ODDS employs a novel data-locality scheduler to transform a compute-centric mapping into a data-centric one and enables each computational process to access the needed data from a local or nearby storage node. ODDS is suitable for parallel applications with dynamic process-to-data scheduling and for applications with static process-to-data assignment. To demonstrate the efficacy of our methods, we present and evaluate ODDS in the context of two state-of-the-art, scientific-analysis applications-mpiBLAST and ParaView-along with the Hadoop distributed file system (HDFS) across a wide variety of computing platform settings. In comparison to existing deployments using NFS, PVFS, or Lustre as the underlying storage systems, ODDS can greatly reduce the I/O cost and double overall performance.
Loading