Fully distributed EM for very large datasets

Jason Andrew Wolfe, Aria Haghighi, Dan Klein

2008 (modified: 11 Nov 2022)ICML 2008Readers: Everyone

Abstract: In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters. For very large data sets, however, even storing all of the parameters in a single node for the M-step can be impractical. We present a framework that fully distributes the entire EM procedure. Each node interacts only with parameters relevant to its data, sending messages to other nodes along a junction-tree topology. We demonstrate improvements over a MapReduce topology, on two tasks: word alignment and topic modeling.

0 Replies