Abstract: Graph partitioning plays a vital role in distributed large-scale web graph analytics, such as pagerank and label propagation. The quality and scalability of partitioning strategy have a strong impact on such communication- and computation-intensive applications, since it drives the communication cost and the workload balance among distributed computing nodes. Recently, the streaming model shows promise in optimizing graph partitioning. However, existing streaming partitioning strategies either lack of adequate quality or fall short in scaling with a large number of partitions. In this work, we explore the property of web graph clustering and propose a novel restreaming algorithm for vertex-cut partitioning. We investigate a series of techniques, which are pipelined as three steps, streaming clustering, cluster partitioning, and partition transformation. More, these techniques can be adapted to a parallel mechanism for further acceleration of partitioning. Experiments on real datasets and real systems show that our algorithm outperforms state-of-the-art vertex-cut partitioning methods in large-scale web graph processing. Surprisingly, the runtime cost of our method can be an order of magnitude lower than that of one-pass streaming partitioning algorithms, when the number of partitions is large.
0 Replies
Loading