The blog will take the form of a survey paper as we summarize the proceedings from the Symposium on Approximate Bayesian Inference, during NeurIPS, 2019 in Vancouver. Rightly so, I focused on the Bayesian workshop as it emphasized the fundamentals, rather than bleeding-edge results. The knowledge that can lead to progress results from understanding how things work on a foundational level. Conferences have traditionally served as the de facto venue for disseminating the latest knowledge in the field. It is customary for these conferences to accept work pushing the “state of the art”. Every work said to be “state of the art” must answer this question without ambiguity. Are improvements in the performance metrics due to the novelty of the method or the preprocessing steps or even random effects or sheer luck? Subsequently, we attended every session of the Symposium to calibrate my understanding of Bayesian statistics and engage with researchers during the poster sessions. Fortunately, it appeared that the contents of some talks were already familiar to me. However, the Bayesian world uses a plethora of jargon that makes the simple concept look convoluted.
Furthermore, we observed a growing effort to unify the Bayesian world with the neural network world. One of the reasons is that it is easier to perform uncertainty quantification when your model has some form of Gaussian process. A few talks were trying to draw this connection. One of the clearest of such attempts was the Neural Tangent’s talk. The premise of the work hinges on answering the question. Can GP be used as a building block for Bayesian deep learning? Neural Tangent’s library is an easy-to-use library for creating finite-width and infinite-width neural networks based on Bayesian modeling. It provided a way to analyze the training dynamics of the neural network. This library can learn on small datasets based on its Bayesian origins. For the first time, I heard the term “infinite-width neural network”, but the details are not fully clear to me. Later, I found the description in a paper released in ICLR 2019. Surprisingly, I also found a universal misunderstanding of “noise” in the workshop. Some refer to noise as variance, bias, overfitting, and underfitting. There is a need for the field to unify the conventions. I can live with having one more acronym to memorize. Okay now, let us discuss the main themes of the workshop. The core of the Symposium is on the following topics:
A number of the talks focused on performing Bayesian computation even in the face of model misspecification, model collapse, and increased variance. One talk attempted to improve the vanilla OMC, resulting in a new method named Robust OMC. Original OMC can fail when the likelihood is flat. The approach favors conditioning on summary statistics rather than using a single point to represent an area where the likelihood is nearly constant. Weights are unstable by default. ROMC provides a way of sampling while preventing model collapse by fixing weights through the stabilization of the matrices. Robustness is achieved by using a variable to switch-off faulty weights in a scheme similar to dropout. Luckily, another talk focused on the formulation of a robust estimate of the likelihood by using pseudo-likelihood based on maximum mean discrepancy which is resilient to issues that may arise due to misspecification of the model.
There was a talk that provided a way to reduce the cost of Bayesian computation by using clever parallelism. Sample efficiency is a measure of the discrepancy between observed and simulated data. This necessitates the creation of a principled sequential Bayesian experimental design to select optimal simulation locations that maximize sample efficiency. The work allows the running of several experiments to choose these locations at once. The work relies on batch simulation to reduce the time for Bayesian inference. Another way of enhancing the robustness of models is by Introducing sparsity in the approximation of Gaussian processes by using inducing points. These inducing points lead to a more scalable algorithm as no neural network or data argumentation is required. Analogous to pre-training and transfer learning in the neural network field to improve performance. There are attempts to replicate the same feat in the Bayesian world. For example, the work about creating probabilistic map for robotic by incrementally updating the model by finding the correspondence between model and data as a form of transfer learning.
There was a paper on a variant of Kalman filter that made use of 2-passes instead of the 1-pass in traditional Kalman filter. Many 2-passes algorithms tend to reproduce noise in the backward pass by using the Brownian Tree which the author claims have a better ability to capture the dynamics of the system. There were talks on performing backpropagation through time (BPTT) where the choice of k for backpropagation is adaptive, and they also provided theoretical guarantees that can learn even under concept drift.
The best talk for me at the Symposium was the normalizing flow for progressive image rendering that provided a principled way to achieve multiple scales of decompressing images with varying quality. The details seem opaque, but I intended to read more about the work. This work has commercial applications and serves as a great work.
There was a talk that connects reinforcement learning with information theory. There are known issues with current reinforcement learning algorithms that include:
The recurring themes in the workshop are:
Finally, It is fair to say that our perception of reality can be relativistic. Hence, this is not an official summary of the Symposium, but my recollection of the unfolding of events. However, the entire proceedings can be found here.