Abstract: Bayesian coresets are small weighted subsamples of the original data, which aim to preserve the full posterior. Many algorithms for Bayesian coreset construction approximate likelihood functions as vectors in a normed space. This approximation often requires iterative recomputation of the full dataset’s likelihood, making it very time-consuming, especially for high-dimensional data. One way to reduce computation time is to construct parts of the coreset on separate processors. This approach is well-studied in the frequentist coreset construction for K-Means clustering, which provides guarantees on random partitioning by using cluster centers as global anchors. However, since likelihood can describe more sophisticated data geometry, distributed Bayesian coreset construction is still challenging. We use relations between K-Means and EM algorithms and propose a partitioning strategy that uses maximum likelihood estimates as anchors of the dataset’s global properties. We compare this strategy with random partitioning. Experiments show that our method outperforms the latter on datasets with complex geometry and still provides sufficient time benefit compared to the coreset construction on the full dataset.
Loading