<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      Normalization is dead, long live normalization! &middot; The ICLR Blog Track
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/2021/12/01/unnormalized-resnets/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- <input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox" > -->

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">

    

    
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/">ICLR 2022 Blog Track</a>
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/about/">About</a>
        
      
    
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/iclr-blog-track/iclr-blog-track.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2022. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="post">
  <h1 id="iclr-post-title" class="post-title">Normalization is dead, long live normalization!</h1>
  <span class="post-date">01 Dec 2021 | 
    <a class="content-tag" href="/tags/#normalization"> normalization </a>
  
    <a class="content-tag" href="/tags/#initialization"> initialization </a>
  
    <a class="content-tag" href="/tags/#propagation"> propagation </a>
  
    <a class="content-tag" href="/tags/#skip-connections"> skip connections </a>
  
    <a class="content-tag" href="/tags/#residual-networks"> residual networks </a>
  </span>

  <span id="iclr-post-authors" class="post-date">Anonymous;</span>
  <style>
    figcaption { color: gray; }
</style>

<p>Since the advent of Batch Normalization (BN), almost every state-of-the-art (SOTA) method uses some form of normalization.
After all, normalization generally speeds up learning and leads to models that generalise better than their unnormalized counterparts.
This turns out to be especially useful when using some form of skip connections, which are prominent in Residual Networks (ResNets), for example.
However, <a href="#brock21characterizing">Brock et al. (2021a)</a> suggest that SOTA performance can also be achieved using <strong>ResNets without normalization</strong>!</p>

<p>The fact that Brock et al. went out of their way to get rid of something as simple as BN in ResNets, for which BN happens to be especially helpful, does raise a few questions:</p>

<ol>
  <li>Why get rid of BN in the first place<a href="#alternatives">?</a></li>
  <li>How (easy is it) to get rid of BN in ResNets<a href="#moment-control">?</a></li>
  <li>Is BN going to become obsolete in the near future<a href="#limitations">?</a></li>
  <li>Does this allow us to gain insights into why BN works so well<a href="#insights">?</a></li>
  <li>Wait a second… Are they getting rid of normalization or just BN<a href="#conclusion">?</a></li>
</ol>

<p>The goal of this blog post is to provide some insights w.r.t. these questions using the results from <a href="#brock21characterizing">Brock et al. (2021a)</a>.</p>

<h2 id="contents">Contents</h2>

<ul>
  <li><a href="#normalization">Normalization</a>
    <ul>
      <li><a href="#origins">Origins</a></li>
      <li><a href="#batch-normalization">Batch Normalization</a></li>
      <li><a href="#alternatives">Alternatives</a></li>
    </ul>
  </li>
  <li><a href="#skip-connections">Skip Connections</a>
    <ul>
      <li><a href="#history">History</a></li>
      <li><a href="#moment-control">Moment Control</a></li>
    </ul>
  </li>
  <li><a href="#normalizer-free-resnets">Normalizer-Free ResNets</a>
    <ul>
      <li><a href="#old-ideas">Old Ideas</a></li>
      <li><a href="#imitating-signal-propagation">Imitating Signal Propagation</a></li>
      <li><a href="#performance">Performance</a></li>
    </ul>
  </li>
  <li><a href="#discussion">Discussion</a>
    <ul>
      <li><a href="#limitations">Limitations</a></li>
      <li><a href="#insights">Insights</a></li>
      <li><a href="#conclusion">Conclusion</a></li>
    </ul>
  </li>
  <li><a href="#extra-code-snippets">Extra Code Snippets</a></li>
  <li><a href="#references">References</a></li>
</ul>

<h2 id="normalization">Normalization</h2>

<p>To set the scene for a world without normalization, we start with an overview of normalization layers in neural networks.
Batch Normalization is probably the most well-known method, but there are plenty of alternatives.
Despite the variety of normalization methods, they all build on the same principle ideas.</p>

<h3 id="origins">Origins</h3>

<p>The design of modern normalization layers in neural networks is mainly inspired by data normalization (<a href="#lecun98efficient">Lecun et al., 1998</a>; <a href="#schraudolph98centering">Schraudolph, 1998</a>; <a href="#ioffe15batchnorm">Ioffe &amp; Szegedy, 2015</a>).
In the setting of a simple linear regression, it can be shown (see e.g., <a href="#lecun98efficient">Lecun et al., 1998</a>) that the second-order derivative, i.e., the Hessian, of the objective is exactly the covariance of the input data, $\mathcal{D}$:</p>

\[\frac{1}{|\mathcal{D}|} \sum_{(\boldsymbol{x}, y) \in \mathcal{D}} \nabla_{\boldsymbol{w}}^2 \frac{1}{2}(\boldsymbol{w}^\mathsf{T} \boldsymbol{x} - y)^2 = \frac{1}{|\mathcal{D}|}  \sum_{(\boldsymbol{x}, y) \in \mathcal{D}}\boldsymbol{x} \boldsymbol{x}^\mathsf{T}.\]

<p>If the Hessian of an optimization problem is (close to) the identity, it becomes much easier to find a solution (<a href="#lecun98efficient">Lecun et al., 1998</a>).
Therefore, learning should become easier if the input data is whitened — i.e., is transformed to have an identity covariance matrix.
However, full whitening of the data is often costly and might even degenerate generalization performance (<a href="#wadia21whitening">Wadia et al., 2021</a>).
Instead, the data is <em>normalized</em> to have zero mean and unit variance to get at least some of the benefits of an identity Hessian.</p>

<p>When considering multi-layer networks, the expectation would be that things get more complicated.
However, it turns out that the benefits of normalizing the input data for linear regression directly carry over to the individual layers of a multi-layer network (<a href="#lecun98efficient">Lecun et al., 1998</a>).
Therefore, simply normalizing the inputs to a layer — i.e., the outputs from the previous layer — should also help to speed up the optimization of the weights in that layer.
Using these insights, <a href="#schraudolph98centering">Schraudolph (1998)</a> showed empirically that centering the activations effectively speeds up learning.</p>

<p>Also initialization strategies commonly build on these principles (e.g., <a href="#lecun98efficient">Lecun et al., 1998</a>; <a href="#glorot10understanding">Glorot &amp; Bengio, 2010</a>; <a href="#he15delving">He et al., 2015</a>).
Since the initial parameters of a layer are independent of the inputs, they can easily be tuned.
When tuned correctly, it can be assured that the (pre)-activations of each layer are normalized throughout the network before the first update.
However, as soon as the network is being updated, the distributions change and the normalizing properties of the initialization get lost (<a href="#ioffe15batchnorm">Ioffe &amp; Szegedy, 2015</a>).</p>

<h3 id="batch-normalization">Batch Normalization</h3>

<p>In contrast to classical initialization methods, Batch Normalization (BN) is able to maintain fixed mean and variance of the activations as the network is being updated (<a href="#ioffe15batchnorm">Ioffe &amp; Szegedy, 2015</a>).
Concretely, this is achieved by applying a typical data normalization to every mini-batch of data, $\mathcal{B}$:</p>

\[\hat{\boldsymbol{x}} = \frac{\boldsymbol{x} - \boldsymbol{\mu}_\mathcal{B}}{\boldsymbol{\sigma}_\mathcal{B}}.\]

<p>Here $\boldsymbol{\mu}_\mathcal{B} = \frac{1}{|\mathcal{B}|} \sum_{\boldsymbol{x} \in \mathcal{B}} \boldsymbol{x}$ is the mean over the inputs in the mini-batch and $\boldsymbol{\sigma}_\mathcal{B}$ is the corresponding standard deviation.
Also, note that the division is element-wise and generally is numerically stabilized by some $\varepsilon$ when implemented.
In case a zero mean and unit variance is not desired, it is also possible to apply an affine transformation $\boldsymbol{y} = \boldsymbol{\gamma} \odot \hat{\boldsymbol{x}} + \boldsymbol{\beta}$ with learnable scale $(\boldsymbol{\gamma})$ and mean ($\boldsymbol{\beta}$) parameters (<a href="#ioffe15batchnorm">Ioffe &amp; Szegedy, 2015</a>).
Putting these formulas together in (<a href="https://pytorch.org">PyTorch</a>) code, BN can be summarised as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">batch_normalize</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">1.</span><span class="p">,</span> <span class="n">beta</span><span class="o">=</span><span class="mf">0.</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-5</span><span class="p">):</span>
    <span class="n">mu</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">2</span><span class="p">))</span>
    <span class="n">var</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">2</span><span class="p">))</span>
    <span class="n">x_hat</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mu</span><span class="p">)</span> <span class="o">/</span> <span class="n">torch</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span> <span class="o">+</span> <span class="n">eps</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">gamma</span> <span class="o">*</span> <span class="n">x_hat</span> <span class="o">+</span> <span class="n">beta</span>
</code></pre></div></div>

<p>The above description explains the core operation of BN during training.
However, during inference, it is not uncommon to desire predictions for single samples.
Obviously, this would cause trouble because a mini-batch with a single sample has zero variance.
Therefore, it is common to accumulate the statistics that are used for normalization ( $\boldsymbol{\mu}_\mathcal{B}$ and $\boldsymbol{\sigma}_\mathcal{B}^2$ ) over multiple mini-batches during training.
These accumulated statistics can then be used as estimators for the mean and variance during inference.
This makes it possible for BN to be used on single samples during inference.</p>

<p>The original reason for introducing BN was to alleviate the so-called <em>internal covariate shift</em>, i.e. the change of distributions as the network updates.
More recent research has pointed out, however, that internal covariate shift does not necessarily deteriorate learning dynamics (<a href="#santurkar18how">Santurkar et al., 2018</a>).
Apparently, <a href="#ioffe15batchnorm">Ioffe &amp; Szegedy (2015)</a> also realized that simply normalizing the signal does not suffice to achieve good performance:</p>

<blockquote>
  <p>[…] the model blows up when the normalization parameters are computed outside the gradient descent step.</p>
</blockquote>

<p>All of this seems to indicate that part of the success of BN is due to the effects it has on the gradient signal.
 The affine transformation in BN simply scales the gradient, such that $\nabla_{\hat{\boldsymbol{x}}} \mathcal{L} = \boldsymbol{\gamma} \odot \nabla_{\boldsymbol{y}} \mathcal{L}.$
 The normalization operation, on the other hand, transforms the gradient, $\boldsymbol{g} = \nabla_{\hat{\boldsymbol{x}}} \mathcal{L}$, as follows:</p>

\[\nabla_{\boldsymbol{x}} \mathcal{L} = \frac{1}{\boldsymbol{\sigma}_\mathcal{B}} \big(\boldsymbol{g} - \mu_g \,\boldsymbol{1} - \operatorname{cov}(\boldsymbol{g}, \hat{\boldsymbol{x}}) \odot \hat{\boldsymbol{x}} \big),\]

<p>where $\mu_g = \sum_{\boldsymbol{x} \in \mathcal{B}} \nabla_{\hat{\boldsymbol{x}}} \mathcal{L}$ and $\operatorname{cov}(\boldsymbol{g}, \hat{\boldsymbol{x}}) = \frac{1}{|\mathcal{B} |} \sum_{\boldsymbol{x} \in \mathcal{B}} \boldsymbol{g} \odot \hat{\boldsymbol{x}}.$
Note that this directly corresponds to centering the gradients, which is also supposed to improve learning speed (<a href="#schraudolph98centering">Schraudolph, 1998</a>).</p>

<p>In the end, everyone seems to agree that one of the main benefits of BN is that it enables higher learning rates (<a href="#ioffe15batchnorm">Ioffe &amp; Szegedy, 2015</a>; <a href="#bjorck18understanding">Bjorck et al., 2018</a>; <a href="#santurkar18how">Santurkar et al., 2018</a>; <a href="#luo19towards">Luo et al., 2019</a>), which results in faster learning and better generalization.
An additional benefit is that BN is scale-invariant and therefore much less sensitive to weight initialization (<a href="#ioffe15batchnorm">Ioffe &amp; Szegedy, 2015</a>; <a href="#ioffe17batchrenorm">Ioffe, 2017</a>).</p>

<h3 id="alternatives">Alternatives</h3>

<p>Why would we ever want to get rid of BN then?
Although BN provides important benefits, it also comes with a few downsides:</p>

<ul>
  <li>BN does not work well with <strong>small batch sizes</strong> (<a href="#ba16layernorm">Ba et al., 2016</a>; <a href="#salimans16weightnorm">Salimans &amp; Kingma, 2016</a>; <a href="#ioffe17batchrenorm">Ioffe, 2017</a>).
For a batch size of one, we have zero standard deviation, but also with a few samples, the estimated statistics are often not accurate enough.</li>
  <li>BN is not directly applicable to certain input types (<a href="#ba16layernorm">Ba et al. 2016</a>; also see Figure <a href="#fig_dims">1</a>) and performs poorly when there are <strong>dependencies between samples</strong> in a mini-batch (<a href="#ioffe17batchrenorm">Ioffe, 2017</a>).</li>
  <li>BN uses <strong>different statistics for inference</strong> than those used during training (<a href="#ba16layernorm">Ba et al., 2016</a>; <a href="#ioffe17batchrenorm">Ioffe, 2017</a>).
This is especially problematic if the distribution during inference is different or drifts away from the training distribution.</li>
  <li>BN does not play well with <strong>other regularization</strong> methods (<a href="#hoffer18norm">Hoffer et al., 2018</a>).
This is especially known for $\mathrm{L}_2$-regularization (<a href="#hoffer18norm">Hoffer et al., 2018</a>) and dropout (<a href="#li19understanding">Li et al., 2019</a>).</li>
  <li>BN introduces a significant <strong>computational overhead</strong> during training (<a href="#ba16layernorm">Ba et al., 2016</a>; <a href="#salimans16weightnorm">Salimans &amp; Kingma, 2016</a>; <a href="#gitman17comparison">Gitman and Ginsburg, 2017</a>).
Because of the running averages, also memory requirements increase when introducing BN.</li>
</ul>

<p>Therefore, alternative normalization methods have been proposed to solve one or more of the problems listed above while trying to maintain the benefits of BN.</p>

<figure id="fig_dims">
    <img src="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/images/2021-12-01-unnormalized-resnets/data_dimensions.svg" alt="visualization of different input data types" />
    <figcaption>
        Figure&nbsp;1: Different input types in terms of their typical 
        batch size ($|\mathcal{B}|$), the number of channels/features ($C$) and the <em>size</em> of the signal ($S$) (e.g. width times height for images).
        Image inspired by (<a href="#wu18groupnorm">Wu &amp; He, 2018</a>).
    </figcaption>
</figure>

<p>One family of alternatives simply computes the statistics along different dimensions (see Figure <a href="#fig_norm">2</a>).
<strong>Layer Normalization (LN)</strong> is probably the most prominent example in this category (<a href="#ba16layernorm">Ba et al., 2016</a>).
Instead of computing the statistics over samples in a mini-batch, LN uses the statistics of the feature vector itself.
This makes LN invariant to weight shifts and scaling individual samples.
BN, on the other hand, is invariant to data shifts and scaling individual neurons.
LN generally outperforms BN in fully connected and recurrent networks but does not work well for convolutional architectures according to <a href="#ba16layernorm">Ba et al. (2016)</a>.
<strong>Group Normalization (GN)</strong> is a slightly modified version of LN that also works well for convolutional networks (<a href="#wu18groupnorm">Wu et al., 2018</a>).
The idea of GN is to compute statistics over groups of features in the feature vector instead of all features.
For convolutional networks that should be invariant to changes in contrast, statistics can also be computed over single image channels for each sample.
This gives rise to a technique known as <strong>Instance Normalization (IN)</strong>, which proved especially helpful in the context of style transfer (<a href="#ulyanov17improved">Ulyanov et al., 2017</a>).</p>

<figure id="fig_norm">
    <img src="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/images/2021-12-01-unnormalized-resnets/normalisation_dimensions.svg" alt="visualization of normalization methods" />
    <figcaption>
        Figure&nbsp;2: Normalization methods (Batch, Layer, Instance and Group Normalization) and the parts of the input they compute their statistics over.
        Different dimensions are visualized (and explained) in Figure&nbsp;<a href="#fig_dims">1</a>.
        The lightly shaded region for LN indicates the additional context that is typically used for image data.
        Image has been adapted from (<a href="#wu18groupnorm">Wu &amp; He, 2018</a>).
    </figcaption>
</figure>

<p>Instead of normalizing the inputs, it is also possible to get a normalizing effect by rescaling the weights of the network (<a href="#arpit16normprop">Arpit et al., 2016</a>).
Especially in convolutional networks, this can significantly reduce the computational overhead.
With <strong>Weight Normalization (WN)</strong> (<a href="#salimans16weightnorm">Salimans &amp; Kingma, 2016</a>), the weight vectors for each neuron are normalized to have unit norm.
This idea can also be found in a(n independently developed) technique called <strong>Normalization Propagation (NP)</strong> (<a href="#arpit16normprop">Arpit et al., 2016</a>).
However, in contrast to WN, NP accounts for the effect of (ReLU) activation functions.
In some sense, NP can be interpreted as a variant of BN where the statistics are computed theoretically (in expectation) rather than on the fly.
<strong>Spectral Normalization (SN)</strong>, on the other hand, makes use of an induced matrix norm to normalise the entire weight matrix (<a href="#miyato18spectralnorm">Miyato et al., 2018</a>).
Concretely, the weights are scaled by the reciprocal of an approximation of the largest singular value of the weight matrix.</p>

<p>Whereas WN, NP and SN still involve the computation of some weight norm, it is also possible to obtain normalization without any computational overhead.
By creating a forward pass that induces attracting fixed points in mean and variance, <strong>Self-Normalizing Networks (SNNs)</strong> (<a href="#klambauer17selfnorm">Klambauer et al., 2017</a>) are able to effectively normalise the signal.
To achieve these fixed points, it suffices to carefully scale the ELU activation function (<a href="#clevert16elu">Clevert et al., 2016</a>) and the initial variance of the weights.
Additionally, <a href="#klambauer17selfnorm">Klambauer et al. (2017)</a> provide a way to tweak dropout so that it does not interfere with the normalization.
Maybe it is useful to point out that SNNs do not consist of explicit normalization operations.
In this sense, an SNN could already be seen as an example of <em>normalizer-free</em> networks.</p>

<h2 id="skip-connections">Skip Connections</h2>

<p>With normalization out of the way, we probably want to tackle the <em>skip connections</em>.
After all, <a href="#brock21characterizing">Brock et al. (2021a)</a> mainly aim to rid Residual Networks (ResNets) of normalization.
Although skip connections already existed long before ResNets were invented, they are often considered as one of the main contributions by the work of <a href="#he16resnet">He et al., 2016</a>.
In some sense, it almost seems as if skip connections could only become popular after BN was invented.
Especially if we consider the effects of skip connections on the statistics of signals flowing through the network.</p>

<h3 id="history">History</h3>

<p><em>Shortcut</em> or <em>skip connections</em> make it possible for information to bypass one or more layers in a neural network.
Mathematically, they are typically expressed using a formalism of the form</p>

\[\boldsymbol{y} = \boldsymbol{x} + f(\boldsymbol{x}),\]

<p>where $f$ represents some non-linear transformation (<a href="#he16resnet">He et al., 2016a</a>, <a href="#he16preresnet">2016b</a>).
This non-linear transformation is typically a sub-network that is commonly referred to as the <em>residual branch</em> or <em>residual connection</em>.
When the outputs of the residual branch have different dimensions, it is typical to use a linear transformation to match the output dimension of the skip connection with that of the residual connection.</p>

<p>Since it often helps to have a few lines of code to understand these vague descriptions, an implementation of the skip connections from (<a href="#he16preresnet">He et al., 2016b</a>) is given below.
The comments aim to highlight the differences with the ResNets from (<a href="#he16resnet">He et al., 2016a</a>).
For a complete implementation of this skip connection module, we refer to the <a href="#pre-activation-resnets">code</a> at the end of this post.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">preact</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>  <span class="c1"># diff 1: compute global pre-activations
</span>        <span class="n">skip</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">downsample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">residual</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">residual_branch</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="c1"># return torch.relu(residual + skip) (diff 2)
</span>        <span class="k">return</span> <span class="n">residual</span> <span class="o">+</span> <span class="n">skip</span>
</code></pre></div></div>

<p>Skip connections became very popular in computer vision due to the work of He et al. (<a href="#he16resnet">2016a</a>).
However, they were already commonly used as a trick to improve learning in multi-layer networks before deep learning was even a thing (<a href="#ripley96pattern">Ripley, 1996</a>).
Similar to normalization methods, skip connections can improve the condition of the optimization problem by making it harder for the Hessian to become singular (<a href="#vandersmagt98solving">van der Smagt &amp; Hirzinger, 1998</a>).
However, skip connections also have benefits in the forward pass:
e.g., <a href="#srivastava15highway">Srivastava et al. (2015)</a> argue that information should be able to flow through the network without being altered.
<a href="#he16resnet">He et al., (2016a)</a>, on the other hand, claim that learning should be easier if the network can focus on the non-linear part of the transformation (and ignore the linear component).</p>

<figure id="fig_skip">
    <img src="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/images/2021-12-01-unnormalized-resnets/skip_connections.svg" alt="visualization of different types of skip connections" />
    <figcaption>
        Figure&nbsp;3: Variations on skip connections in ResNets, Densenets and Highway networks.
        The white blocks correspond to the input / skip connection and the blue blocks correspond to the output of the non-linear transformation.
        The greyscale blocks are values between zero and one and correspond to masks.
    </figcaption>
</figure>

<p>The general formulation of skip connections that we provided earlier, captures the idea of skip connections very well.
As you might have expected, however, there are plenty of variations on the exact formulation (a few of which are illustrated in Figure <a href="#fig_skip">3</a>).
Strictly speaking, even <a href="#he16resnet">He et al., (2016a)</a> do not adhere to their own formulation because they apply an activation function on what we denoted as $\boldsymbol{y}$ (<a href="#he16preresnet">He et al., 2016b</a>; see code snippet).
In DenseNets (<a href="#huang17densenet">G. Huang et al., 2017</a>), the outputs of the skip and residual connections are concatenated instead of aggregated by means of a sum.
This retains more of the information for subsequent layers.
Other variants of skip connections make use of masks to select which information is passed on.
Highway networks (<a href="#srivasta15highway">Srivasta et al., 2015</a>) make use of a gating mechanism similar to that in Long Short-Term Memory (LSTM) (<a href="#hochreiter97lstm">Hochreiter et al., 1997</a>).
These gates enable the network to learn how information from the skip connection is to be combined with that of the residual branch.
Similarly, Transformers (<a href="#vaswani17attention">Vaswani et al., 2017</a>) could be interpreted as a variation on highway networks without residual branches.
This comparison does only hold, however, if you are willing to interpret the attention mask as some form of complex gate for the skip connection.</p>

<h3 id="moment-control">Moment Control</h3>

<p>Traditional initialization techniques manage to provide a stable starting point for the propagation of mean and variance in fully connected layers, but they do not work so well in ResNets.
The key problem is that the variance can not remain constant when skip connections are involved.
After all, the variance is linear and unless the non-linear transformation branch would output a zero-variance signal, the output variance must be greater than the input variance.
Moreover, if the signal would have a strictly positive mean, also the mean would start drifting when skip connections are chained together.
Luckily, these drifting effects can be mitigated to some extent.
On one side by using BN, but what are the alternatives exactly?</p>

<p>Similar to standard initialization methods, the key idea to counter drifting in ResNets is to stabilise the variance propagation.
To this end, a slightly modified formulation of skip connections is typically used (e.g., <a href="#szegedy16inceptionv4">Szegedy et al., 2016</a>; <a href="#balduzzi17shattered">Balduzzi et al., 2017</a>; <a href="#hanin18how">Hanin &amp; Rolnick, 2018</a>):</p>

\[\boldsymbol{y} = \alpha \boldsymbol{x} + \beta f(\alpha \boldsymbol{x}),\]

<p>which is equivalent to the original formulation when $\alpha = \beta = 1.$
The key advantage of this formulation is that the variance can be controlled (to some extent) by tuning the newly introduced scaling factors $\alpha$ and $\beta.$
In terms of code, these modifications could look something like</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">preact</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">alpha</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span>
        <span class="n">skip</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">downsample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">residual</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">residual_branch</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">beta</span> <span class="o">*</span> <span class="n">residual</span> <span class="o">+</span> <span class="n">skip</span>
</code></pre></div></div>

<p>A very simple counter-measure to the variance explosion in ResNets is to set $\alpha = 1 / \sqrt{2}$ (<a href="#balduzzi17shattered">Balduzzi et al., 2017</a>).
Assuming that the residual branch approximately preserves the variance, the variances of $\boldsymbol{y}$ and $\boldsymbol{x}$ should be roughly the same.
In practice, however, it seems to be more common to tune the $\beta$ factor instead of $\alpha$ (<a href="#balduzzi17shattered">Balduzzi et al., 2017</a>).
For instance, simply setting $\beta$ to some small value (e.g., in the range $[0.1, 0.3]$) can already help ResNets (with BN) to stabilise training (<a href="#szegedy16inceptionv4">Szegedy et al., 2016</a>).
It turns out that having small values for $\beta$ can help to preserve correlations between gradients, which should benefit learning (<a href="#balduzzi17shattered">Balduzzi et al., 2017</a>).</p>

<p>Similar findings were established through the analysis of the variance propagation in ResNets by <a href="#hanin18how">Hanin &amp; Rolnick (2018)</a>.
Eventually, they propose to set $\beta = b^l$ after the $l$-th skip connection, with $0 &lt; b &lt; 1$ to make sure that the sum of scaling factors from all layers converges.
<a href="#arpit19how">Arpit et al. (2019)</a> additionally take the backward pass into account and show that $\beta = L^{-1}$ provides stable variance propagation in a ResNet with $L$ skip connections.
Learning the scaling factor $\beta$ in each layer can also make it possible to keep the variance under control (<a href="#zhang19fixup">Zhang et al., 2019</a>; <a href="#de20skipinit">De &amp; Smith, 2020</a>).</p>

<p>There are of course also workarounds that do not quite fit the general formulation with scaling factors $\alpha$ and $\beta.$
One alternative solution is to make use of an empirical approach to weight initialization (<a href="#mishkin16lsuv">Mishkin et al., 2016</a>).
By rescaling random orthogonal weight matrices by the empirical variance of the output activations at each layer, <a href="#mishkin16lsuv">Mishkin et al. (2016)</a> show that it is possible to train ResNets without BN.
In some sense, this approach can be interpreted as choosing a scaling factor for each layer in the residual branch (and in some of the skip connections).
Instead of using the reciprocal of the empirical variance as a scaling factor, <a href="#zhang19fixup">Zhang et al. (2019)</a> scale the initial weights of the $k$-th layer in each of the $L$ residual branches by a factor $L^{-1/(2k-2)}.$
<a href="#shao20rescalenet">Shao et al. (2020)</a> propose to combine the skip connection using the slightly modified formulation, $\boldsymbol{y} = \alpha \boldsymbol{x} + \beta f(\boldsymbol{x}),$ where $\alpha^2 = 1 - \beta^2$ and $\beta^2 = 1 / (l + c)$ for the $l$-th skip connection. 
Here, $c$ is an arbitrary constant, which was eventually set to be the number of residual branches, $L$.
For a single-layer ResNet ($l = c = 1$), this is equivalent to setting $\alpha = 1 / \sqrt{2},$ as suggested by <a href="#balduzzi17shattered">Balduzzi et al. (2017)</a>.
However, the more general approach should assure that the outputs of residual branches are weighted similarly at the output of the network, independent of their depth.</p>

<h2 id="normalizer-free-resnets">Normalizer-Free ResNets</h2>

<p>It could be argued that the current popularity of skip connections is due to BN.
After all, without BN, the skip connections in ResNets would have suffered from the drifting effects discussed <a href="#moment-control">earlier</a>.
However, this does not take away that BN does have a few <a href="#alternatives">practical issues</a> and there do seem to be alternative techniques to control these drifting effects.
Therefore, it makes sense to research the question of whether BN is just a useful or a <em>necessary</em> component of the ResNet architecture.</p>

<h3 id="old-ideas">Old Ideas</h3>

<p>Whereas some alternative normalization methods aimed to simply provide normalization in scenarios where BN does not work so well, other methods were explicitly designed to reduce or get rid of the normalization computations (e.g., <a href="#arpit16normprop">Arpit et al., 2016</a>; <a href="#salimans16weightnorm">Salimans &amp; Kingma, 2016</a>; <a href="#klambauer17selfnorm">Klambauer et al., 2017</a>).
Even the idea of training ResNets without BN is practically as old as ResNets themselves.
With their Layer-Sequential Unit-Variance (LSUV) initialization, <a href="#mishkin16lsuv">Mishkin et al. (2016)</a> showed that it is possible to replace BN with good initialization for small datasets (CIFAR-10).
Similarly, <a href="#arpit19">Arpit et al. (2019)</a> are able to close the gap between Weight Normalization (WN) and BN by reconsidering weight initialization in ResNets.</p>

<p>Getting rid of BN in ResNets was posed as an explicit goal by <a href="#zhang19fixup">Zhang et al. (2019)</a>, who proposed the so-called FixUp initialization scheme.
On top of introducing the learnable $\beta$ parameters and the $L^{-1/(2k - 2)}$ scaling in residual branches,
they set the initial weights for the last layer in each residual branch to zero and introduce scalar biases before every layer in the network.
With these tricks, Zhang et al. show that FixUp can provide <em>almost</em> the same benefits as BN for ResNets in terms of trainability and generalization.
Using a different derivation, <a href="#de20skipinit">De &amp; Smith (2020)</a> end up with a very similar solution to train ResNets without BN, which they term SkipInit.
The key difference with FixUp is that the initial value for the learnable $\beta$ parameter is set to be less than $1 / \sqrt{L}.$
As a result, SkipInit does not require the rescaling of initial weights in residual branches or setting weights to zero, which are considered crucial parts of the FixUp strategy (<a href="#zhang19fixup">Zhang et al. (2019)</a>).</p>

<h3 id="imitating-signal-propagation">Imitating Signal Propagation</h3>

<p>Although the results of prior work look promising, there is still a performance gap compared to ResNets with BN.
To close this gap, <a href="#brock21characterizing">Brock et al. (2021a)</a> suggest studying the propagation of mean and variance through ResNets by means of so-called Signal Propagation Plots (SPPs).
These SPPs simply visualise the squared mean and variance of the activations after each skip connection, as well as the variance at the end of every residual branch (before the skip connection).</p>

<p>To compute these values, the forward pass of the network must be slightly tweaked.
To this end, we can define a new method or a function that simulates the forward pass and extracts the necessary statistics for each skip connection, as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="o">@</span><span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">signal_prop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">2</span><span class="p">)):</span>
        <span class="c1"># forward code
</span>        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">preact</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">skip</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">downsample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">residual</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">residual_branch</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">out</span> <span class="o">=</span> <span class="n">residual</span> <span class="o">+</span> <span class="n">skip</span>

        <span class="c1"># compute necessary statistics
</span>        <span class="n">out_mu2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">out</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">).</span><span class="n">item</span><span class="p">()</span>
        <span class="n">out_var</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">out</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">dim</span><span class="p">)).</span><span class="n">item</span><span class="p">()</span>
        <span class="n">res_var</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">residual</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">dim</span><span class="p">)).</span><span class="n">item</span><span class="p">()</span>
        <span class="k">return</span> <span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">out_mu2</span><span class="p">,</span> <span class="n">out_var</span><span class="p">,</span> <span class="n">res_var</span><span class="p">)</span>
</code></pre></div></div>

<p>This allows us to analyse the statistics for a single skip connection.
By propagating a white noise signal (e.g., <code class="language-plaintext highlighter-rouge">torch.randn(1000, 3, 224, 224))</code>) through the entire ResNet, we obtain the data that allows us to produce SPPs.
We refer to the end of this post for an example <a href="#multi-layer-spp">implementation</a> of a full NF-ResNet with <code class="language-plaintext highlighter-rouge">signal_prop</code> method.</p>

<p>Figure <a href="#fig_spp">4</a> provides an example of the SPPs for a pre-activation ResNets (or v2 ResNets, cf. <a href="#he16identity">He et al., 2016b</a>) with and without BN.
The SPPs on the left clearly illustrate that BN transforms the exponential growth to a linear increase in ResNets, as described in theory (e.g., <a href="#balduzzi17shattered">Balduzzi et al., 2017</a>; <a href="#de20skipinit">De &amp; Smith, 2020</a>).
When focusing on ResNets with BN (on the right of Figure <a href="#fig_spp">4</a>), it is clear that mean and variance are reduced after every sub-net, each of which consists of a few skip connections.
This reduction is due to the <em>pre-activation</em> block (BN + ReLU) that is inserted between every two sub-nets in these ResNets (remember the code snippet from earlier?).</p>

<figure id="fig_spp">
    <img src="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/images/2021-12-01-unnormalized-resnets/spp.svg" alt="Image with two plots. The left plot shows two signal propagation plots: one for ResNets with (increasing gray lines) and one for ResNets without (approximately flat blue lines) Batch Normalization on a logarithmic scale. The right plot shows the zig-zag lines that represent the squared mean and variance after each residual branch." width="100%" />
    <figcaption>
        Figure&nbsp;4: Example Signal Propagation Plots (SPPs) for a pre-activation (v2) ResNet-50 at initialization.
        SPPs plot the squared mean ($\mu^2$) and variance ($\sigma^2$) of the pre-activations after each skip connection ($x$-axis), as well as the variance of the residuals before the skip connection ($\sigma_f^2$, $y$-axis on the right).
        The left plot illustrates the difference between ResNets with and without BN layers.
        The plot on the right shows the same SPP for a ResNet with BN without the logarithmic scaling.
        Note that ResNet-50 has four sub-nets with 3, 4, 6 and 3 skip connections, respectively.
    </figcaption>
</figure>

<p>The goal of Normalizer-Free ResNets (NF-ResNets) is to get rid of the BN layers in ResNets while preserving the characteristics visualized in the SPPs (<a href="#brock21characterizing">Brock et al., 2021a</a>).
To get rid of the exponential variance increase in unnormalized ResNets, it suffices to set $\alpha = 1 / \sqrt{\operatorname{Var}[\boldsymbol{x}]}$ in our modified formulation of ResNets.
Here, $\operatorname{Var}[\boldsymbol{x}]$ is the variance over all samples in the dataset, such that the $\alpha$ scaling effectively mirrors the division by $\boldsymbol{\sigma}_\mathcal{B}$ in BN (assuming a large enough batch size).
Unlike BN, however, the scaling in NF-ResNets is computed analytically for every skip connection.
This is possible if the inputs to the network are properly normalized (i.e., have unit variance) and if the residual branch, $f$, properly preserves variance (i.e. is initialized correctly).
The $\beta$ parameter, on the other hand, is simply used as a hyper-parameter to directly control the variance increase after every skip connection.</p>

<p>It might be useful to point out that the proposed $\alpha$ scaling does not perfectly conform with our general formulation for ResNets.
After all, the pre-activation layers mostly end up affecting only the inputs to the residual branch, such that $\boldsymbol{y} = \boldsymbol{x} + \beta f(\alpha \boldsymbol{x})$ (see <a href="#extra-code-snippets">code</a> for details).
Only between the different sub-networks, which consist of multiple skip connections, the pre-activations are applied globally and the signal will be normalised.
This also explains the variance drops in the SPPs for regular ResNets (see Figure <a href="#fig_spp">4</a>).
Note that this also means that the variance within sub-networks of an NF-ResNet will increase in the same way as for a ResNet with BN.
Although it would have been perfectly possible to maintain a steady variance, NF-ResNets are effectively designed to mimic the signal propagation due to BN layers in regular ResNets.</p>

<figure id="fig_nfresnet">
    <img src="https://iclr.iro.umontreal.ca/ae0d1532-9f35-4cf0-b83b-816a5725ebed_1642185414/public/images/2021-12-01-unnormalized-resnets/spp_nfresnet.svg" alt="Image with two plots. The left plot shows two SPPs: one for a ResNet with Batch Normalization (gray lines) and one for a Normalizer-Free ResNet (blue lines). The curves representting variance for both models are very close to each other, but the curve for the mean is quite different. The right plot is similar, but now the blue mean and residual variance curves are zero and one everywhere, respectively." width="100%" />
    <figcaption>
        Figure&nbsp;5: SPPs comparing an NF-ResNet-50 to a Resnet with BN at initialization.
        The NF-ResNet in the left plot only uses the $\alpha$ and $\beta$ scaling parameters.
        The right plot displays the behaviour of an NF-ResNet with Centered Weight Normalization.
        Note that the variance of the residuals in the right plot should give some insights as to why the curves do not overlap.
    </figcaption>
</figure>

<p>As can be seen on the left plot in Figure <a href="#fig_nfresnet">5</a>, a plain NF-ResNet effectively imitates the variance propagation of the baseline ResNet pretty accurately.
The propagation of the squared mean in NF-ResNets, on the other hand, looks nothing like that from the BN model.
After all, the considerations that lead to the scaling parameters only cover the variance propagation.
On top of that, it turns out that the variance of the residual branches (right before it is merged with the skip connection) is not particularly steady.
This indicates that the residual branches do not properly preserve variance, which is necessary for the analytic computations of $\alpha$ to be correct.</p>

<p>It turns out that both of these discrepancies can be resolved by introducing a variant of Centered Weight Normalization (CWN; <a href="#huang17centred">L. Huang et al., 2017</a>) to NF-ResNets.
CWN simply applies WN after subtracting the weight mean from each weight vector, which ensures that every output has zero mean and that the variance of the weights is constant.
<a href="#brock21characterizing">Brock et al. (2021a)</a> additionally rescale the normalized weights to account for the effect of activation functions (cf. <a href="#arpit16normprop">Arpit et al., 2016</a>).
The effect of including the rescaled CWN in NF-ResNets is illustrated in the right part of Figure <a href="#fig_nfresnet">5</a>.</p>

<h3 id="performance">Performance</h3>

<p>Empirically, <a href="#brock21characterizing">Brock et al. (2021a)</a> show that NF-ResNets with standard regularization methods perform on par with traditional ResNets that are using BN.
An important <a href="https://github.com/deepmind/deepmind-research/blob/ba761289c157fc151c7f06aa37b812d8100561db/nfnets/resnet.py#L158-L159">detail</a> that is not apparent from the text, however, is that their baseline ResNets use the (standard) “<em>BN -&gt; ReLU</em>” order and not the “<em>ReLU -&gt; BN</em>” order, which served as the model for the signal propagation of NF-ResNets.
This is also why the SPPs in Figure <a href="#fig_nfresnet">5</a>, which depict the “<em>ReLU -&gt; BN</em>” order, do not perfectly overlap, unlike the figures in (<a href="#borck21characterizing">Brock et al., 2021a</a>).</p>

<p>Because BN does induce computational overhead, it seems natural to expect NF-ResNets to allow for more computationally efficient models.
Therefore, <a href="#brock21characterizing">Brock et al. (2021a)</a> also compare NF-ResNets with a set of architectures that are optimized for efficiency.
However, it turns out that some of these architectures do not play well with the weight normalization that is typically used in NF-ResNets.
As a result, normalizer-free versions of EfficientNets (<a href="#tan19efficientnet">Tan &amp; Le, 2019</a>) lag behind their BN counterparts.
When applied to (naive) RegNets (<a href="#radosovic20regnet">Radosavovic et al., 2020</a>), however, the performance gap between with EfficientNets can be reduced by introducing the NF-ResNet scheme.
In subsequent work, <a href="#brock21highperformance">Brock et al. (2021b)</a> show that NF-ResNets in combination with gradient clipping are able to outperform similar networks with BN.</p>

<h2 id="discussion">Discussion</h2>

<p>NF-ResNets show that it is possible to build networks without BN that are able to achieve competitive prediction performance.
It is not yet entirely clear whether the ideas of NF-ResNets could make BN entirely obsolete, however.
Therefore, it should be interesting to take a closer look at what the limitations of NF-ResNets are.
Assuming that the ideas in NF-ResNets can make BN (at least partly) obsolete, this should also provide some insights as to what the important factors are to explain the success of BN.</p>

<h3 id="limitations">Limitations</h3>

<p>First of all, the exact procedure for scaling residual branches is only meaningful for architectures that include (some sort of) skip connections.
In general, it is not possible to apply the ideas behind NF-ResNets to get rid of BN layers in arbitrary architectures.
Furthermore, NF-ResNets still rely on normalization methods to attain good performance — in contrast to what their name might suggest.
<a href="#brock21characterizing">Brock et al. (2021a)</a> emphasise that they effectively do away with <em>activation normalization</em>, but they do rely on an adaptation of Weight Normalization to replace BN.
In this sense, it is arguable whether NF-ResNets are truly normalizer-free.
Finally, some of the problems with BN are not resolved or reintroduced when building competitive NF-ResNets.
E.g., there are still differences between training and testing when using plain dropout regularization, CWN still introduces a certain computational overhead during training, etc.</p>

<h3 id="insights">Insights</h3>

<p>In the end, an NF-ResNet can be interpreted as consisting of different components that model parts of what BN normally does.
For example, the $\alpha$ scaling factor used in NF-ResNets obviously models the division by the standard deviation of BN.
It is also easy to see that the implicit regularization that is attributed to BN can be replaced by explicit regularization schemes.
Furthermore, the mean subtraction in BN is practically implemented by means of the weight centering in CWN.
Also, the scale-invariance of the weights of BN is re-introduced through CWN.
The input scale-invariance that BN introduces in each layer, on the other hand, is lost when using CWN.
When considering the entire residual branch (or network), however, $\alpha$ does enable some sort of scale-invariance for the entirety of this branch (or network).
Finally, the affine transformation after the normalization in BN is modelled by scaling the result of CWN.
Note that the affine shift does not need to be modelled explicitly, since CWN does not annihilate the regular bias parameters of the layers it acts upon, in contrast to BN.</p>

<p>Although the effects of BN on the forward pass seem to be modelled quite well by NF-ResNets, the effects on the backward pass seem to be largely ignored by <a href="#brock21characterizing">Brock et al. (2021a)</a>.
Follow-up work by <a href="#brock21highperformance">Brock et al. (2021b)</a> suggests that these effects might not be unimportant.
After all, the gradient flow in NF-ResNets is only affected by the scaling factors, $\alpha$ and $\beta,$ since CWN does not affect the gradients w.r.t. the inputs.
Therefore, regular NF-ResNets do not have a gradient centering (<a href="#schraudolph98centering">Schraudolph, 1998</a>) component, as can be found in BN layers.
However, an adaptive gradient clipping scheme (<a href="#brock21highperformance">Brock et al. 2021</a>) seems to provide an effective alternative to the gradient dynamics that are inherent to BN.</p>

<h3 id="conclusion">Conclusion</h3>

<p>NF-ResNets show that it is possible to get rid of BN in ResNets without throwing away predictive performance.
However, NF-ResNets still rely on weight normalization schemes to make the models competitive with their BN counterparts.
Therefore, it could be argued that NF-ResNets are not entirely <em>normalizer-free</em>.
It almost seems as if NF-ResNets are an example of how BN can be imitated using different components, rather than how to get rid of it.
This also means that it is hard to distil meaningful insights as to why/how BN works so well.
One thing that this approach does make clear is that the backward dynamics due to BN should be part of the explanation.</p>

<p><strong>TL;DR:</strong> NF-ResNets, rescaled ResNets with Centered Weight Normalization, can be used to imitate the forward pass of ResNets with BN, but they do not help much to explain what makes BN so successful.</p>

<hr />

<h2 id="extra-code-snippets">Extra Code Snippets</h2>

<p>To facilitate the implementation of pre-residual networks in pytorch and to give a full example of how to implement the signal propagation plotting, we provide additional code snippets in <a href="https://pytorch.org">PyTorch</a>.</p>

<h4 id="pre-activation-resnets">Pre-activation ResNets</h4>

<p>The first snippet implements skip connections according to (<a href="#he16preresnet">He et al., 2016b</a>).
The comments aim to highlight the differences with the ResNets from (<a href="#he16resnet">He et al., 2016a</a>), for which an <a href="https://github.com/pytorch/vision/blob/v0.11.2/torchvision/models/resnet.py#L86-L141">implementation</a> is included in the <a href="https://pytorch.org/vision/stable/models.html#id10">Torchvision</a> library.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Callable</span>


<span class="k">class</span> <span class="nc">PreResidualBottleneck</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>

    <span class="n">expansion</span> <span class="o">=</span> <span class="mi">4</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span>
        <span class="n">inplanes</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">planes</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">stride</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
        <span class="n">downsample</span><span class="p">:</span> <span class="n">nn</span><span class="p">.</span><span class="n">Module</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">groups</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
        <span class="n">base_width</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">64</span><span class="p">,</span>
        <span class="n">dilation</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
        <span class="n">norm_layer</span><span class="p">:</span> <span class="n">Callable</span><span class="p">[[</span><span class="nb">int</span><span class="p">],</span> <span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">no_preact</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span>  <span class="c1"># additional argument
</span>    <span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="k">if</span> <span class="n">norm_layer</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">norm_layer</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">BatchNorm2d</span>

        <span class="c1">### pre-activations ###
</span>        <span class="n">preact_layers</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">if</span> <span class="n">no_preact</span> <span class="k">else</span> <span class="p">[</span>
            <span class="n">norm_layer</span><span class="p">(</span><span class="n">inplanes</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
        <span class="p">]</span>
        <span class="k">if</span> <span class="n">downsample</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">preact</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Identity</span><span class="p">()</span>
            <span class="n">residual_preact</span> <span class="o">=</span> <span class="n">preact_layers</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">preact</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">preact_layers</span><span class="p">)</span>
            <span class="n">residual_preact</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="c1">### pre-activations ###
</span>
        <span class="n">kernel_size</span> <span class="o">=</span> <span class="mi">3</span>
        <span class="n">width</span> <span class="o">=</span> <span class="n">groups</span> <span class="o">*</span> <span class="p">(</span><span class="n">planes</span> <span class="o">*</span> <span class="n">base_width</span> <span class="o">//</span> <span class="mi">64</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">downsample</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Identity</span><span class="p">()</span> <span class="k">if</span> <span class="n">downsample</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">downsample</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">residual_branch</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="o">*</span><span class="n">residual_preact</span><span class="p">,</span>  <span class="c1"># include residual pre-activations
</span>            <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">inplanes</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">),</span>
            <span class="n">norm_layer</span><span class="p">(</span><span class="n">width</span><span class="p">),</span> <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">width</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="n">dilation</span><span class="p">,</span>
                      <span class="n">dilation</span><span class="o">=</span><span class="n">dilation</span><span class="p">,</span> <span class="n">groups</span><span class="o">=</span><span class="n">groups</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">),</span>
            <span class="n">norm_layer</span><span class="p">(</span><span class="n">width</span><span class="p">),</span> <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">width</span><span class="p">,</span> <span class="n">planes</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">expansion</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">),</span>
            <span class="c1"># norm_layer(planes * self.expansion),
</span>        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">preact</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>  <span class="c1"># compute global pre-activations
</span>        <span class="n">skip</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">downsample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">residual</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">residual_branch</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="c1"># return torch.relu(residual + skip)
</span>        <span class="k">return</span> <span class="n">residual</span> <span class="o">+</span> <span class="n">skip</span>

</code></pre></div></div>

<h4 id="nf-resnets">NF-ResNets</h4>

<p>When comparing the code for a skip connection between an NF-ResNet and a regular batch-normalized ResNet, we find that there are only a few minor changes.
So much so that it is more efficient to consider the <code class="language-plaintext highlighter-rouge">diff</code> output than the full code.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -4,3 +4,3 @@</span>
 
<span class="gd">-class PreResidualBottleneck(nn.Module):
</span><span class="gi">+class NFResidualBottleneck(nn.Module):
</span> 
<span class="p">@@ -16,3 +16,4 @@</span>
         dilation: int = 1,
<span class="gd">-        norm_layer: Callable[[int], nn.Module] = None,
</span><span class="gi">+        alpha: float = 1.,
+        beta: float = 1.,
</span>         no_preact: bool = False,  # additional argument
<span class="p">@@ -20,4 +21,3 @@</span>
         super().__init__()
<span class="gd">-        if norm_layer is None:
-            norm_layer = nn.BatchNorm2d
</span><span class="gi">+        self.beta = beta
</span> 
<span class="p">@@ -25,3 +25,3 @@</span>
         preact_layers = [] if no_preact else [
<span class="gd">-            norm_layer(inplanes),
</span><span class="gi">+            Scaling(alpha),
</span>             nn.ReLU(),
<span class="p">@@ -41,8 +41,8 @@</span>
             *residual_preact,  # include residual pre-activations
<span class="gd">-            nn.Conv2d(inplanes, width, 1, bias=False),
-            norm_layer(width), nn.ReLU(),
</span><span class="gi">+            nn.Conv2d(inplanes, width, 1, bias=True),
+            nn.ReLU(),
</span>             nn.Conv2d(width, width, kernel_size, stride, padding=dilation,
<span class="gd">-                      dilation=dilation, groups=groups, bias=False),
-            norm_layer(width), nn.ReLU(),
-            nn.Conv2d(width, planes * self.expansion, 1, bias=False),
</span><span class="gi">+                      dilation=dilation, groups=groups, bias=True),
+            nn.ReLU(),
+            nn.Conv2d(width, planes * self.expansion, 1, bias=True),
</span>             # norm_layer(planes * self.expansion),
<span class="p">@@ -55,2 +55,2 @@</span>
         # return torch.relu(residual + skip)
<span class="gd">-        return residual + skip
</span><span class="gi">+        return self.beta * residual + skip
</span>
</code></pre></div></div>

<p>The patch above shows that apart from removing the BN layers and introducing the $\alpha$ and $\beta$ parameters, the BN layer in the pre-activation has to be replaced by the $\alpha$ scaling that is introduced in NF-ResNets.
These changes are effectively everything that needs to be done.
To be fair, this <code class="language-plaintext highlighter-rouge">Scaling</code> module is not standard in PyTorch, but it is easy enough to create it:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Scaling</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">scale</span><span class="p">:</span> <span class="nb">float</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">scale</span> <span class="o">=</span> <span class="n">scale</span>
    
    <span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">__class__</span><span class="p">.</span><span class="n">__name__</span><span class="si">}</span><span class="s">(</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">scale</span><span class="si">}</span><span class="s">)"</span>
    
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">scale</span> <span class="o">*</span> <span class="n">x</span>

</code></pre></div></div>

<p>Putting everything together, including the <code class="language-plaintext highlighter-rouge">signal_prop</code> method introduced <a href="#imitating-signal-propagation">earlier</a>, the resulting code should correspond to the following:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Callable</span>


<span class="k">class</span> <span class="nc">NFResidualBottleneck</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>

    <span class="n">expansion</span> <span class="o">=</span> <span class="mi">4</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span>
        <span class="n">inplanes</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">planes</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">stride</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
        <span class="n">downsample</span><span class="p">:</span> <span class="n">nn</span><span class="p">.</span><span class="n">Module</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">groups</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
        <span class="n">base_width</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">64</span><span class="p">,</span>
        <span class="n">dilation</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
        <span class="n">alpha</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.</span><span class="p">,</span>
        <span class="n">beta</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.</span><span class="p">,</span>
        <span class="n">no_preact</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span>
    <span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">beta</span> <span class="o">=</span> <span class="n">beta</span>

        <span class="c1">### pre-activations ###
</span>        <span class="n">preact_layers</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">if</span> <span class="n">no_preact</span> <span class="k">else</span> <span class="p">[</span>
            <span class="n">Scaling</span><span class="p">(</span><span class="n">alpha</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
        <span class="p">]</span>
        <span class="k">if</span> <span class="n">downsample</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">preact</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Identity</span><span class="p">()</span>
            <span class="n">residual_preact</span> <span class="o">=</span> <span class="n">preact_layers</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">preact</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">preact_layers</span><span class="p">)</span>
            <span class="n">residual_preact</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="c1">### pre-activations ###
</span>
        <span class="n">kernel_size</span> <span class="o">=</span> <span class="mi">3</span>
        <span class="n">width</span> <span class="o">=</span> <span class="n">groups</span> <span class="o">*</span> <span class="p">(</span><span class="n">planes</span> <span class="o">*</span> <span class="n">base_width</span> <span class="o">//</span> <span class="mi">64</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">downsample</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Identity</span><span class="p">()</span> <span class="k">if</span> <span class="n">downsample</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">downsample</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">residual_branch</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="o">*</span><span class="n">residual_preact</span><span class="p">,</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">inplanes</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">True</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">width</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="n">dilation</span><span class="p">,</span>
                      <span class="n">dilation</span><span class="o">=</span><span class="n">dilation</span><span class="p">,</span> <span class="n">groups</span><span class="o">=</span><span class="n">groups</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">True</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">width</span><span class="p">,</span> <span class="n">planes</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">expansion</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">True</span><span class="p">),</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">preact</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">skip</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">downsample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">residual</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">residual_branch</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">beta</span> <span class="o">*</span> <span class="n">residual</span> <span class="o">+</span> <span class="n">skip</span>
    
    <span class="o">@</span><span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">signal_prop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">2</span><span class="p">)):</span>
        <span class="c1"># forward code
</span>        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">preact</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">skip</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">downsample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">residual</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">residual_branch</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">out</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">beta</span> <span class="o">*</span> <span class="n">residual</span> <span class="o">+</span> <span class="n">skip</span>

        <span class="c1"># compute necessary statistics
</span>        <span class="n">out_mu2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">out</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">).</span><span class="n">item</span><span class="p">()</span>
        <span class="n">out_var</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">out</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">dim</span><span class="p">)).</span><span class="n">item</span><span class="p">()</span>
        <span class="n">res_var</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">residual</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">dim</span><span class="p">)).</span><span class="n">item</span><span class="p">()</span>
        <span class="k">return</span> <span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">out_mu2</span><span class="p">,</span> <span class="n">out_var</span><span class="p">,</span> <span class="n">res_var</span><span class="p">)</span>

</code></pre></div></div>

<p>The code for a full NF-ResNet (with multiple multi-layer sub-nets) can be found in a code snippets for <a href="#multi-layer-spps">multi-layer SPPs</a>.</p>

<h4 id="multi-layer-spps">Multi-layer SPPs</h4>

<p>In order to give an example of how to collect the SPP data for a multi-layer ResNet, the snippet below provides code for an NF-ResNet.
For the sake of <em>brevity</em>, the implementation for CWN has been omitted here.
This code is inspired by the <a href="https://github.com/pytorch/vision/blob/v0.11.2/torchvision/models/resnet.py#L144-L249"><code class="language-plaintext highlighter-rouge">ResNet</code></a> implementation from Torchvision.
If you want to use this code, make sure that the <code class="language-plaintext highlighter-rouge">NFResidualBottleneck</code> module also provides a <code class="language-plaintext highlighter-rouge">signal_prop</code> method, as introduced <a href="#imitating-signal-propagation">earlier</a>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">NFResidualNetwork</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>

    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">initialisation</span><span class="p">(</span><span class="n">m</span><span class="p">:</span> <span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
        <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">):</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">init</span><span class="p">.</span><span class="n">kaiming_normal_</span><span class="p">(</span><span class="n">m</span><span class="p">.</span><span class="n">weight</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">m</span><span class="p">.</span><span class="n">bias</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
                <span class="n">nn</span><span class="p">.</span><span class="n">init</span><span class="p">.</span><span class="n">zeros_</span><span class="p">(</span><span class="n">m</span><span class="p">.</span><span class="n">bias</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">layers</span><span class="p">:</span> <span class="nb">tuple</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">,</span> <span class="n">beta</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="n">block</span> <span class="o">=</span> <span class="n">NFResidualBottleneck</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_inplanes</span> <span class="o">=</span> <span class="mi">64</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_expected_var</span> <span class="o">=</span> <span class="mf">1.</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">beta</span> <span class="o">=</span> <span class="n">beta</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">intro</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">7</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">3</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">MaxPool2d</span><span class="p">(</span><span class="n">kernel_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
        <span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">subnet1</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_make_subnet</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">no_preact</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">subnet2</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_make_subnet</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">subnet3</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_make_subnet</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">subnet4</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_make_subnet</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">classifier</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">AdaptiveAvgPool2d</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Flatten</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">512</span> <span class="o">*</span> <span class="n">block</span><span class="p">.</span><span class="n">expansion</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">),</span>
        <span class="p">)</span>

        <span class="bp">self</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">initialisation</span><span class="p">)</span>
        <span class="c1"># self.apply(CentredWeightNormalization(dim=(1, 2, 3)))
</span>    
    <span class="k">def</span> <span class="nf">_make_subnet</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">block</span><span class="p">,</span> <span class="n">planes</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">num_layers</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> 
                     <span class="n">stride</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">no_preact</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span><span class="p">):</span>
        <span class="n">downsample</span> <span class="o">=</span> <span class="bp">None</span>
        <span class="k">if</span> <span class="n">stride</span> <span class="o">!=</span> <span class="mi">1</span> <span class="ow">or</span> <span class="bp">self</span><span class="p">.</span><span class="n">_inplanes</span> <span class="o">!=</span> <span class="n">planes</span> <span class="o">*</span> <span class="n">block</span><span class="p">.</span><span class="n">expansion</span><span class="p">:</span>
            <span class="n">downsample</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_inplanes</span><span class="p">,</span> <span class="n">planes</span> <span class="o">*</span> <span class="n">block</span><span class="p">.</span><span class="n">expansion</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stride</span><span class="p">)</span>
        
        <span class="n">layers</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="c1"># compute expected variance analytically
</span>        <span class="n">alpha</span> <span class="o">=</span> <span class="mf">1.</span> <span class="o">/</span> <span class="bp">self</span><span class="p">.</span><span class="n">_expected_var</span> <span class="o">**</span> <span class="p">.</span><span class="mi">5</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_expected_var</span> <span class="o">=</span> <span class="mf">1.</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">beta</span> <span class="o">**</span> <span class="mi">2</span>
        <span class="n">layers</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">block</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_inplanes</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">downsample</span><span class="p">,</span>
            <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span> <span class="n">beta</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">beta</span><span class="p">,</span> <span class="n">no_preact</span><span class="o">=</span><span class="n">no_preact</span>
        <span class="p">))</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_inplanes</span> <span class="o">=</span> <span class="n">planes</span> <span class="o">*</span> <span class="n">block</span><span class="p">.</span><span class="n">expansion</span>
        <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">num_layers</span><span class="p">):</span>
            <span class="c1"># track expected variance analytically
</span>            <span class="n">alpha</span> <span class="o">=</span> <span class="mf">1.</span> <span class="o">/</span> <span class="bp">self</span><span class="p">.</span><span class="n">_expected_var</span> <span class="o">**</span> <span class="p">.</span><span class="mi">5</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_expected_var</span> <span class="o">+=</span> <span class="bp">self</span><span class="p">.</span><span class="n">beta</span> <span class="o">**</span> <span class="mi">2</span>
            <span class="n">layers</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">block</span><span class="p">(</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">_inplanes</span><span class="p">,</span> <span class="n">planes</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span> <span class="n">beta</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">beta</span>
            <span class="p">))</span>
        
        <span class="k">return</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">layers</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">intro</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">subnet1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">subnet2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">subnet3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">subnet4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">classifier</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>

    <span class="o">@</span><span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">signal_prop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">2</span><span class="p">)):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">intro</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>

        <span class="n">statistics</span> <span class="o">=</span> <span class="p">[(</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">).</span><span class="n">item</span><span class="p">(),</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">dim</span><span class="p">)).</span><span class="n">item</span><span class="p">(),</span>
            <span class="nb">float</span><span class="p">(</span><span class="s">'nan'</span><span class="p">),</span>
        <span class="p">)]</span>
        <span class="k">for</span> <span class="n">subnet</span> <span class="ow">in</span> <span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">subnet1</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">subnet2</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">subnet3</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">subnet4</span><span class="p">):</span>
            <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="n">subnet</span><span class="p">:</span>
                <span class="n">x</span><span class="p">,</span> <span class="n">stats</span> <span class="o">=</span> <span class="n">layer</span><span class="p">.</span><span class="n">signal_prop</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">dim</span><span class="p">)</span>
                <span class="n">statistics</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">stats</span><span class="p">)</span>
        
        <span class="c1"># convert list of tuples to tuple of lists
</span>        <span class="n">sp</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">list</span><span class="p">,</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">statistics</span><span class="p">)))</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">classifier</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">sp</span>

</code></pre></div></div>

<h2 id="references">References</h2>

<p><span id="arpit16normprop">Arpit, D., Zhou, Y., Kota, B., &amp; Govindaraju, V. (2016). Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks. 
Proceedings of The 33rd International Conference on Machine Learning, 48, 1168–1176.</span> 
(<a href="https://proceedings.mlr.press/v48/arpitb16.html">link</a>,
 <a href="http://proceedings.mlr.press/v48/arpitb16.pdf">pdf</a>)</p>

<p><span id="arpit19how">Arpit, D., Campos, V., &amp; Bengio, Y. (2019). How to Initialize your Network? Robust Initialization for WeightNorm &amp; ResNets. 
Advances in Neural Information Processing Systems, 32, 10902–10911.</span>
(<a href="https://papers.nips.cc/paper/2019/hash/e520f70ac3930490458892665cda6620-Abstract.html">link</a>,
 <a href="https://papers.nips.cc/paper/2019/file/e520f70ac3930490458892665cda6620-Paper.pdf">pdf</a>)</p>

<p><span id="ba16layernorm">Ba, J. L., Kiros, J. R., &amp; Hinton, G. E. (2016). Layer Normalization [Preprint]. </span> 
(<a href="http://arxiv.org/abs/1607.06450">link</a>,
 <a href="http://arxiv.org/pdf/1607.06450.pdf">pdf</a>)</p>

<p><span id="balduzzi17shattered">Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W.-D., &amp; McWilliams, B. (2017). The Shattered Gradients Problem: If resnets are the answer, then what is the question? 
Proceedings of the 34th International Conference on Machine Learning, 70, 342–350.</span> 
(<a href="https://proceedings.mlr.press/v70/balduzzi17b.html">link</a>,
 <a href="http://proceedings.mlr.press/v70/balduzzi17b/balduzzi17b.pdf">pdf</a>)</p>

<p><span id="bjorck18understanding">Bjorck, N., Gomes, C. P., Selman, B., &amp; Weinberger, K. Q. (2018). Understanding Batch Normalization. 
Advances in Neural Information Processing Systems, 31, 7694–7705. </span> 
(<a href="https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc/paper/2018/file/36072923bfc3cf47745d704feb489480-Paper.pdf">pdf</a>)</p>

<p><span id="brock21characterizing">Brock, A., De, S., &amp; Smith, S. L. (2021a). Characterizing signal propagation to close the performance gap in unnormalized ResNets. 
International Conference on Learning Representations 9.</span>
(<a href="https://openreview.net/forum?id=IX3Nnir2omJ">link</a>,
 <a href="https://openreview.net/pdf?id=IX3Nnir2omJ">pdf</a>)</p>

<p><span id="brock21highperformance">Brock, A., De, S., Smith, S. L., &amp; Simonyan, K. (2021b). High-Performance Large-Scale Image Recognition Without Normalization [Preprint].</span>
(<a href="http://arxiv.org/abs/2102.06171">link</a>,
 <a href="http://arxiv.org/pdf/2102.06171.pdf">pdf</a>)</p>

<p><span id="clevert16elu">Clevert, D.-A., Unterthiner, T., &amp; Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). 
International Conference on Learning Representations 4.</span> 
(<a href="http://arxiv.org/abs/1511.07289">link</a>,
 <a href="http://arxiv.org/pdf/1511.07289.pdf">pdf</a>)</p>

<p><span id="de20skipinit">De, S., &amp; Smith, S. L. (2020). Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks. 
Advances in Neural Information Processing Systems, 33, 19964–19975.</span>
(<a href="https://proceedings.neurips.cc//paper/2020/hash/e6b738eca0e6792ba8a9cbcba6c1881d-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc//paper/2020/file/e6b738eca0e6792ba8a9cbcba6c1881d-Paper.pdf">pdf</a>)</p>

<p><span id="gitman17comparison">Gitman, I., &amp; Ginsburg, B. (2017). Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification [Preprint]. </span> 
g(<a href="http://arxiv.org/abs/1709.08145">link</a>,
 <a href="http://arxiv.org/pdf/1709.08145.pdf">pdf</a>)</p>

<p><span id="hanin18how">Hanin, B., &amp; Rolnick, D. (2018). How to Start Training: The Effect of Initialization and Architecture. 
Advances in Neural Information Processing Systems, 31, 571–581.</span>
(<a href="https://proceedings.neurips.cc/paper/2018/hash/d81f9c1be2e08964bf9f24b15f0e4900-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc/paper/2018/file/d81f9c1be2e08964bf9f24b15f0e4900-Paper.pdf">pdf</a>)</p>

<p><span id="he15delving">He, K., Zhang, X., Ren, S., &amp; Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. 
Proceedings of the IEEE International Conference on Computer Vision, 1026–1034.</span> 
(<a href="https://doi.org/10.1109/ICCV.2015.123">link</a>,
 <a href="https://openaccess.thecvf.com/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf">pdf</a>)</p>

<p><span id="he16resnet">He, K., Zhang, X., Ren, S., &amp; Sun, J. (2016a). Deep Residual Learning for Image Recognition. 
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.</span> 
(<a href="https://doi.org/10.1109/CVPR.2016.90">link</a>,
 <a href="https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf">pdf</a>)</p>

<p><span id="he16preresnet">He, K., Zhang, X., Ren, S., &amp; Sun, J. (2016b). Identity Mappings in Deep Residual Networks. 
In B. Leibe, J. Matas, N. Sebe, &amp; M. Welling (Eds.), Computer Vision – ECCV 2016 (pp. 630–645). Springer International Publishing. </span> 
(<a href="https://doi.org/10.1007/978-3-319-46493-0_38">link</a>,
 <a href="https://arxiv.org/pdf/1603.05027.pdf">pdf</a>)</p>

<p><span id="hochreiter97lstm">Hochreiter, S., &amp; Schmidhuber, J. (1997). Long Short-Term Memory. 
Neural Computation, 9(8), 1735–1780. </span> 
(<a href="https://doi.org/10.1162/neco.1997.9.8.1735">link</a>,
 <a href="https://ml.jku.at/publications/older/2604.pdf">pdf</a>)</p>

<p><span id="hoffer18norm">Hoffer, E., Banner, R., Golan, I., &amp; Soudry, D. (2018). Norm matters: Efficient and accurate normalization schemes in deep networks. 
Advances in Neural Information Processing Systems, 31, 2160–2170. </span> 
(<a href="https://proceedings.neurips.cc/paper/2018/hash/a0160709701140704575d499c997b6ca-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc/paper/2018/file/a0160709701140704575d499c997b6ca-Paper.pdf">pdf</a>)</p>

<p><span id="huang17centred">Huang, L., Liu, X., Liu, Y., Lang, B., &amp; Tao, D. (2017). Centered Weight Normalization in Accelerating Training of Deep Neural Networks. 
Proceedings of the IEEE International Conference on Computer Vision, 2822–2830.</span> 
(<a href="https://doi.org/10.1109/ICCV.2017.305">link</a>,
 <a href="https://openaccess.thecvf.com/content_ICCV_2017/papers/Huang_Centered_Weight_Normalization_ICCV_2017_paper.pdf">pdf</a>)</p>

<p><span id="huang17densenet">Huang, G., Liu, Z., Van Der Maaten, L., &amp; Weinberger, K. Q. (2017). Densely Connected Convolutional Networks. 
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269. </span> 
(<a href="https://doi.org/10.1109/CVPR.2017.243">link</a>,
 <a href="https://openaccess.thecvf.com/content_cvpr_2017/papers/Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf">pdf</a>)</p>

<p><span id="ioffe15batchnorm">Ioffe, S., &amp; Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. 
Proceedings of the 32nd International Conference on Machine Learning, 37, 448–456.</span> 
(<a href="http://proceedings.mlr.press/v37/ioffe15.html">link</a>,
 <a href="http://proceedings.mlr.press/v37/ioffe15.pdf">pdf</a>)</p>

<p><span id="ioffe17batchrenorm">Ioffe, S. (2017). Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models. 
Advances in Neural Information Processing Systems, 30, 1945–1953. </span> 
(<a href="https://proceedings.neurips.cc/paper/2017/hash/c54e7837e0cd0ced286cb5995327d1ab-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc/paper/2017/file/c54e7837e0cd0ced286cb5995327d1ab-Paper.pdf">pdf</a>)</p>

<p><span id="klambauer17selfnorm">Klambauer, G., Unterthiner, T., Mayr, A., &amp; Hochreiter, S. (2017). Self-Normalizing Neural Networks. 
Advances in Neural Information Processing Systems, 30, 971–980.</span> 
(<a href="https://papers.nips.cc/paper/2017/hash/5d44ee6f2c3f71b73125876103c8f6c4-Abstract.html">link</a>,
 <a href="https://papers.nips.cc/paper/2017/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf">pdf</a>)</p>

<p><span id="lecun98efficient">LeCun, Y., Bottou, L., Orr, G. B., &amp; Müller, K.-R. (1998). Efficient BackProp. 
In G. B. Orr &amp; K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade (1st ed., pp. 9–50). Springer. </span> 
(<a href="https://doi.org/10.1007/3-540-49430-8_2">link</a>,
 <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">pdf</a>)</p>

<p><span id="li19understanding">Li, X., Chen, S., Hu, X., &amp; Yang, J. (2019). Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift. 
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2682–2690. </span> 
(<a href="https://doi.org/10.1109/CVPR.2019.00279">link</a>,
 <a href="https://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Understanding_the_Disharmony_Between_Dropout_and_Batch_Normalization_by_Variance_CVPR_2019_paper.pdf">pdf</a>)</p>

<p><span id="luo19towards">Luo, P., Wang, X., Shao, W., &amp; Peng, Z. (2019). Towards Understanding Regularization in Batch Normalization. 6. </span>
(<a href="https://openreview.net/forum?id=HJlLKjR9FQ">link</a>,
 <a href="https://openreview.net/pdf?id=HJlLKjR9FQ">pdf</a>)</p>

<p><span id="miyato18spectralnorm">Miyato, T., Kataoka, T., Koyama, M., &amp; Yoshida, Y. (2018). Spectral Normalization for Generative Adversarial Networks. 
International Conference on Learning Representations 6.</span> 
(<a href="https://openreview.net/forum?id=B1QRgziT-">link</a>,
 <a href="https://openreview.net/pdf?id=B1QRgziT-">pdf</a>)</p>

<p><span id="mishkin16lsuv">Mishkin, D., &amp; Matas, J. (2016). All you need is a good init. 
International Conference on Learning Representations 4.</span> 
(<a href="http://arxiv.org/abs/1511.06422">link</a>,
 <a href="http://arxiv.org/pdf/1511.06422.pdf">pdf</a>)</p>

<p><span id="radosavovic20regnet">Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., &amp; Dollár, P. (2020). Designing Network Design Spaces. 
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10425–10433.</span>
(<a href="https://doi.org/10.1109/CVPR42600.2020.01044">link</a>,
 <a href="https://openaccess.thecvf.com/content_CVPR_2020/papers/Radosavovic_Designing_Network_Design_Spaces_CVPR_2020_paper.pdf">pdf</a>)</p>

<p><span id="ripley96pattern">Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. </span> 
(<a href="https://doi.org/10.1017/CBO9780511812651">link</a>)</p>

<p><span id="salimans16weightnorm">Salimans, T., &amp; Kingma, D. P. (2016). Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. 
Advances in Neural Information Processing Systems, 29, 901–909.</span> 
(<a href="https://proceedings.neurips.cc/paper/2016/hash/ed265bc903a5a097f61d3ec064d96d2e-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc/paper/2016/file/ed265bc903a5a097f61d3ec064d96d2e-Paper.pdf">pdf</a>)</p>

<p><span id="santurkar18how">Santurkar, S., Tsipras, D., Ilyas, A., &amp; Madry, A. (2018). How Does Batch Normalization Help Optimization? 
Advances in Neural Information Processing Systems, 31, 2483–2493.</span> 
(<a href="https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc/paper/2018/file/905056c1ac1dad141560467e0a99e1cf-Paper.pdf">pdf</a>)</p>

<p><span id="schraudolph98centering">Schraudolph, N. N. (1998). Centering Neural Network Gradient Factors. 
In G. B. Orr &amp; K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade (1st ed., pp. 207–226). Springer.</span> 
(<a href="https://doi.org/10.1007/3-540-49430-8_11">link</a>,
 <a href="https://n.schraudolph.org/pubs/Schraudolph98.pdf">pdf</a>)</p>

<p><span id="shao20rescalenet">Shao, J., Hu, K., Wang, C., Xue, X., &amp; Raj, B. (2020). Is normalization indispensable for training deep neural network? 
Advances in Neural Information Processing Systems, 33, 13434–13444.</span>
(<a href="https://proceedings.neurips.cc/paper/2020/hash/9b8619251a19057cff70779273e95aa6-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc/paper/2020/file/9b8619251a19057cff70779273e95aa6-Paper.pdf">pdf</a>)</p>

<p><span id="srivasta15highway">Srivastava, R. K., Greff, K., &amp; Schmidhuber, J. (2015). Training Very Deep Networks. 
Advances in Neural Information Processing Systems, 28, 2377–2385. </span> 
(<a href="https://papers.nips.cc/paper/2015/hash/215a71a12769b056c3c32e7299f1c5ed-Abstract.html">link</a>, 
 <a href="https://papers.nips.cc/paper/2015/file/215a71a12769b056c3c32e7299f1c5ed-Paper.pdf">pdf</a>)</p>

<p><span id="szegedy16inceptionv4">Szegedy, C., Ioffe, S., Vanhoucke, V., &amp; Alemi, A. (2016). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning [Preprint].</span>
(<a href="http://arxiv.org/abs/1602.07261">link</a>,
 <a href="http://arxiv.org/pdf/1602.07261.pdf">pdf</a>)</p>

<p><span id="vandersmagt98solving">van der Smagt, P., &amp; Hirzinger, G. (1998). Solving the Ill-Conditioning in Neural Network Learning. 
In G. B. Orr &amp; K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade (1st ed., pp. 193–206). Springer.</span> 
(<a href="https://doi.org/10.1007/3-540-49430-8_10">link</a>)</p>

<p><span id="vaswani17attention">Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., &amp; Polosukhin, I. (2017). Attention Is All You Need. 
Advances in Neural Information Processing Systems, 30, 5998–6008.</span> 
(<a href="https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html">link</a>,
 <a href="https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">pdf</a>)</p>

<p><span id="wadia21whitening">Wadia, N., Duckworth, D., Schoenholz, S. S., Dyer, E., &amp; Sohl-Dickstein, J. (2021). Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization.
Proceedings of the 38th International Conference on Machine Learning, 139, 10617–10629.</span> 
(<a href="http://proceedings.mlr.press/v139/wadia21a.html">link</a>,
 <a href="http://proceedings.mlr.press/v139/wadia21a/wadia21a.pdf">pdf</a>)</p>

<p><span id="wu18groupnorm">Wu, Y., &amp; He, K. (2018). Group Normalization. 
Computer Vision – ECCV 2018, 3–19. Springer International Publishing. </span> 
(<a href="https://doi.org/10.1007/978-3-030-01261-8_1">link</a>,
 <a href="https://openaccess.thecvf.com/content_ECCV_2018/papers/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.pdf">pdf</a>)</p>

<p><span id="zhang19fixup">Zhang, H., Dauphin, Y. N., &amp; Ma, T. (2019). Fixup Initialization: Residual Learning Without Normalization. 
International Conference on Learning Representations 6. </span> 
(<a href="https://openreview.net/forum?id=H1gsz30cKX">link</a>,
 <a href="https://openreview.net/pdf?id=H1gsz30cKX">pdf</a>)</p>

</div>

<div id="bibtex-container" class="related">
  For attribution in academic contexts, please cite this work as
  <pre id="bibtex-academic-attribution">

  </pre>

  BibTeX citation
  <pre id="bibtex-box">

  </pre>
</div>
<script>
  let authorsSpan = document.getElementById("iclr-post-authors");
  let authorsText = authorsSpan.textContent;
  let lnameFnameInstitution = authorsText.split(";");
  let lfiList = lnameFnameInstitution.map(lfi => lfi.split(",").map(item => item.trim()));
  let bibtexLFI = lfiList.map(lfi => lfi[0] + ", " + lfi[1]).join(" and ")
  let academicLFI = lfiList.map(lfi => lfi[0]);
  {
    if(academicLFI.length > 2) academicLFI = academicLFI[0] + ", et al.";
    else if(academicLFI.length == 2) academicLFI = academicLFI[0] + " & " + academicLFI[1];
    else academicLFI = academicLFI[0];
  }

  let titleSpan = document.getElementById("iclr-post-title");
  let titleText = titleSpan.textContent.trim();
  let bibtexTitleShorthand = (lfiList[0][1]+
    "2022"+
    titleText.split(" ").slice(0, 3).join("")
  ).replace(" ", "").replace(/[\p{P}$+<=>^`|~]/gu, '').toLowerCase().trim();

  let bibtexTemplate = `
@inproceedings{${bibtexTitleShorthand}},
  author = {${bibtexLFI}},
  title = {${titleText}},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {${window.location.href}},
  url  = {${window.location.href}}
}
  `.trim();
  document.getElementById("bibtex-box").innerText = bibtexTemplate;

  let academicTemplate = `
${academicLFI}, "${titleText}", ICLR Blog Track, 2022.
`.trim();
  document.getElementById("bibtex-academic-attribution").innerText = academicTemplate;

</script>


<div class="related">
  <h2>Related posts</h2>
  <ul class="related-posts">
    
      <li>
        <h3>
          <a href="/2021/09/01/sample-submission/">
            Sample Submission
            <small>01 Sep 2021 | 
    <a class="content-tag" href="/tags/#normalization"> normalization </a>
  
    <a class="content-tag" href="/tags/#initialization"> initialization </a>
  
    <a class="content-tag" href="/tags/#propagation"> propagation </a>
  
    <a class="content-tag" href="/tags/#skip-connections"> skip connections </a>
  
    <a class="content-tag" href="/tags/#residual-networks"> residual networks </a>
  </small>
          </a>
        </h3>
      </li>
    
      <li>
        <h3>
          <a href="/2020/04/02/example-content/">
            Example content (Basic Markdown)
            <small>02 Apr 2020 | 
    <a class="content-tag" href="/tags/#normalization"> normalization </a>
  
    <a class="content-tag" href="/tags/#initialization"> initialization </a>
  
    <a class="content-tag" href="/tags/#propagation"> propagation </a>
  
    <a class="content-tag" href="/tags/#skip-connections"> skip connections </a>
  
    <a class="content-tag" href="/tags/#residual-networks"> residual networks </a>
  </small>
          </a>
        </h3>
      </li>
    
  </ul>
</div>


<script src="https://utteranc.es/client.js"
        repo="iclr-blog-track/iclr-blog-track.github.io"
        issue-term="pathname"
        label="utterance"
        theme="boxy-light"
        crossorigin="anonymous"
        >
</script>


      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
