<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>The ICLR Blog Track</title>
 <link href="https://iclr.iro.umontreal.ca/c5a43cf7-5161-4ec9-8dd6-1febae95bd96_1639782823/atom.xml" rel="self"/>
 <link href="https://iclr.iro.umontreal.ca/c5a43cf7-5161-4ec9-8dd6-1febae95bd96_1639782823/"/>
 <updated>2021-12-17T17:13:45-06:00</updated>
 <id>https://iclr.iro.umontreal.ca/c5a43cf7-5161-4ec9-8dd6-1febae95bd96_1639782823</id>
 <author>
   <name>Mark Otto</name>
   <email>markdotto@gmail.com</email>
 </author>

 
 <entry>
   <title>Incorporating Bayesian approaches in Deep Learning Research</title>
   <link href="https://iclr.iro.umontreal.ca/c5a43cf7-5161-4ec9-8dd6-1febae95bd96_1639782823/2021/12/01/Incorporating-Bayesian-approaches-in-Deep-Learning-Research/"/>
   <updated>2021-12-01T00:00:00-06:00</updated>
   <id>https://iclr.iro.umontreal.ca/c5a43cf7-5161-4ec9-8dd6-1febae95bd96_1639782823/2021/12/01/Incorporating-Bayesian-approaches-in-Deep-Learning-Research</id>
   <content type="html">&lt;p&gt;The blog will take the form of a survey paper as we summarize the proceedings from the &lt;a href=&quot;http://approximateinference.org/&quot;&gt;Symposium on Approximate Bayesian Inference&lt;/a&gt;, during &lt;a href=&quot;https://nips.cc/&quot;&gt;NeurIPS&lt;/a&gt;, 2019 in Vancouver. Rightly so, I focused on the Bayesian workshop as it emphasized the fundamentals, rather than bleeding-edge results. The knowledge that can lead to progress results from understanding how things work on a foundational level. Conferences have traditionally served as the de facto venue for disseminating the latest knowledge in the field. It is customary for these conferences to accept work pushing the “state of the art”.  Every work said to be “state of the art” must answer this question without ambiguity. Are improvements in the performance metrics due to the novelty of the method or the preprocessing steps or even random effects or sheer luck? 
Subsequently, we attended every session of the Symposium to calibrate my understanding of Bayesian statistics and engage with researchers during the poster sessions. Fortunately, it appeared that the contents of some talks were already familiar to me. However, the Bayesian world uses a plethora of jargon that makes the simple concept look convoluted.&lt;/p&gt;
&lt;h3 id=&quot;bayesian-world-meets-the-realities-of-deep-learning&quot;&gt;Bayesian world meets the realities of Deep learning&lt;/h3&gt;
&lt;p&gt;Furthermore, we observed a growing effort to unify the Bayesian world with the neural network world. One of the reasons is that it is easier to perform uncertainty quantification when your model has some form of Gaussian process. A few talks were trying to draw this connection. One of the clearest of such attempts was the Neural Tangent’s talk. The premise of the work hinges on answering the question. Can GP be used as a building block for Bayesian deep learning? Neural Tangent’s library is an easy-to-use library for creating finite-width and infinite-width neural networks based on Bayesian modeling. It provided a way to analyze the training dynamics of the neural network. This library can learn on small datasets based on its Bayesian origins. For the first time, I heard the term “infinite-width neural network”, but the details are not fully clear to me. Later, I found the description in a &lt;a href=&quot;https://openreview.net/pdf?id=SkGT6sRcFX&quot;&gt;paper&lt;/a&gt; released in ICLR 2019. Surprisingly, I also found a  universal misunderstanding of “noise” in the workshop. Some refer to noise as variance, bias, overfitting, and underfitting. There is a need for the field to unify the conventions. I can live with having one more acronym to memorize. Okay now, let us discuss the main themes of the workshop. The core of the Symposium is on the following topics:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Robustness&lt;/li&gt;
  &lt;li&gt;A better understanding of generalization.&lt;/li&gt;
  &lt;li&gt;The difficulty of quantifying mutual information.&lt;/li&gt;
  &lt;li&gt;Efficient computation.
    &lt;h3 id=&quot;robustness-generalization-information-theory-and-computation&quot;&gt;Robustness, Generalization, information theory, and Computation&lt;/h3&gt;
    &lt;p&gt;A number of the talks focused on performing Bayesian computation even in the face of model misspecification, model collapse, and increased variance. One talk attempted to improve the vanilla OMC, resulting in a new method named &lt;a href=&quot;https://arxiv.org/pdf/1904.00670.pdf&quot;&gt;Robust OMC&lt;/a&gt;. Original OMC can fail when the likelihood is flat. The approach favors conditioning on summary statistics rather than using a single point to represent an area where the likelihood is nearly constant. Weights are unstable by default. ROMC provides a way of sampling while preventing model collapse by fixing weights through the stabilization of the matrices. Robustness is achieved by using a variable to switch-off faulty weights in a scheme similar to dropout. Luckily, another talk focused on the formulation of a robust estimate of the likelihood by using &lt;a href=&quot;https://arxiv.org/abs/1909.13339&quot;&gt;pseudo-likelihood&lt;/a&gt; based on maximum mean discrepancy which is resilient to issues that may arise due to misspecification of the model. 
&lt;br /&gt;&lt;br /&gt;
There was a talk that provided a way to reduce the cost of Bayesian computation by using clever parallelism. Sample efficiency is a measure of the discrepancy between observed and simulated data. This necessitates the creation of a principled sequential Bayesian experimental design to select optimal simulation locations that maximize sample efficiency. The work allows the running of several experiments to choose these locations at once. The work relies on &lt;a href=&quot;https://arxiv.org/abs/1910.06121&quot;&gt;batch simulation&lt;/a&gt; to reduce the time for Bayesian inference. Another way of enhancing the robustness of models is by Introducing sparsity in the approximation of Gaussian processes by using &lt;a href=&quot;https://arxiv.org/abs/1910.10596&quot;&gt;inducing points&lt;/a&gt;. These inducing points lead to a more scalable algorithm as no neural network or data argumentation is required. Analogous to pre-training and transfer learning in the neural network field to improve performance. There are attempts to replicate the same feat in the Bayesian world. For example, the work about creating &lt;a href=&quot;https://openreview.net/pdf?id=BJgnty2NYr&quot;&gt;probabilistic map&lt;/a&gt; for robotic by incrementally updating the model by finding the correspondence between model and data as a form of transfer learning.
&lt;br /&gt;&lt;br /&gt;
There was a paper on a variant of &lt;a href=&quot;https://openreview.net/forum?id=HkxNKk2VKS&quot;&gt;Kalman filter&lt;/a&gt; that made use of 2-passes instead of the 1-pass in traditional Kalman filter. Many 2-passes algorithms tend to reproduce noise in the backward pass by using the Brownian Tree which the author claims have a better ability to capture the dynamics of the system. There were talks on performing backpropagation through time (BPTT) where the choice of k for backpropagation is adaptive, and they also provided theoretical guarantees that can learn even under concept drift.
&lt;br /&gt;&lt;br /&gt;
The best talk for me at the Symposium was the normalizing flow for progressive image rendering that provided a principled way to achieve multiple scales of decompressing images with varying quality. The details seem opaque, but I intended to read more about the &lt;a href=&quot;https://arxiv.org/pdf/1905.07376.pdf&quot;&gt;work&lt;/a&gt;. This work has commercial applications and serves as a great work.
&lt;br /&gt;&lt;br /&gt;
There was a talk that connects reinforcement learning with information theory. There are known issues with current reinforcement learning algorithms that include:&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;It is likely to get two different solutions based on slight perturbation.&lt;/li&gt;
  &lt;li&gt;Need for the detailed reward scheme&lt;/li&gt;
  &lt;li&gt;Long training times&lt;/li&gt;
  &lt;li&gt;Lacking a diversity-seeking exploration for the reward function
&lt;br /&gt;&lt;br /&gt;
The talk proposed a distribution-matching formulation of reinforcement learning that depends on maximizing entropy over distribution. We track different possibilities from a state before committing probability mass to that area. Choosing a policy to maximize Q values can inadvertently get stuck and leads to suboptimal solutions. However, track all the states using the RL with maximum entropy and do not commit to a probability mass until you are sure it is optimal. It is essentially matching the distribution of states instead of rewards. Optimal actions lead to an optimal future. Inference can be understood as to which action was taken, given that the future was optimal. More information can be found in this &lt;a href=&quot;https://arxiv.org/pdf/1805.00909.pdf&quot;&gt;work&lt;/a&gt;.
    &lt;h3 id=&quot;words-of-wisdom&quot;&gt;Words of Wisdom&lt;/h3&gt;
    &lt;p&gt;The recurring themes in the workshop are:&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;MCMC wins on the bias, Variational inference (VI) wins on variance and amortization. It is better to run MCMC without waiting for convergence as good results may be obtained even before convergence.&lt;/li&gt;
  &lt;li&gt;VI gives quicker convergence, but the results may not be very good. VB (Variational Bayes) fails to capture the heteroscedastic noise and uses homoskedastic noise to fit the data.&lt;/li&gt;
  &lt;li&gt;Reparametrization tricks have widely diverse applications and are available in virtually every poster. I think it is probably the most useful technique in Bayesian literature as it allows for performing differentiation on a process, making it easier to perform gradient descent for optimization.&lt;/li&gt;
  &lt;li&gt;Stratonovich SDE (stochastic differential equation) can be computationally cheaper.&lt;/li&gt;
  &lt;li&gt;The adjoint sensitivity method is a cheaper way of solving ODE and can be used with reverse mode auto-diff for time-efficient and constant memory.&lt;/li&gt;
  &lt;li&gt;The mixture of Gaussian processes is highly non-Gaussians.
    &lt;h3 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h3&gt;
    &lt;p&gt;Finally, It is fair to say that our perception of reality can be relativistic. Hence, this is not an official summary of the Symposium, but my recollection of the unfolding of events. However, the entire &lt;a href=&quot;https://openreview.net/group?id=approximateinference.org/AABI/2019/Symposium&quot;&gt;proceedings&lt;/a&gt; can be found here.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 

</feed>
