Abstract: We design and implement a scalable version of loopy belief propagation (BP), a widely used algorithm for performing inference on probabilistic graphical models. However, implementations of BP on generic data processing platforms such as Apache Spark do not scale well to very large graphical models containing billions of vertices. To handle such large-scale graphs, we leverage a number of strategies. Our implementation is based on Apache Spark GraphX. We propose a novel graph partitioning strategy to reduce both computation and communication overhead providing a 2x speed-up. We use efficient memory management for storing the graph and shared memory for highspeed communication. To evaluate performance and demonstrate scalability of the approach, we perform a range of experiments including using real-world graphs with billions of vertices, where we achieve an overall 10x speed-up over a vanilla Spark baseline. Further, we apply our BP implementation to infer the probability of a website being malicious by performing inference on a graphical model derived from real, large-scale hyperlinked webcrawl data. We have open sourced our implementation.
Loading