<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      Why Are Kronecker Products So Effective? &middot; The ICLR Blog Track
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/2021/12/01/kronecker-effective/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- <input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox" > -->

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">

    

    
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/">ICLR 2022 Blog Track</a>
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/about/">About</a>
        
      
    
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/7314dc03-8fcd-4efa-a53a-6a955885763f_1641068387/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/iclr-blog-track/iclr-blog-track.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2022. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="post">
  <h1 id="iclr-post-title" class="post-title">Why Are Kronecker Products So Effective?</h1>
  <span class="post-date">01 Dec 2021 | 
    <a class="content-tag" href="/tags/#kronecker"> kronecker </a>
  
    <a class="content-tag" href="/tags/#quaternion"> quaternion </a>
  
    <a class="content-tag" href="/tags/#parameter-efficient"> parameter efficient </a>
  
    <a class="content-tag" href="/tags/#tensor-decomposition"> tensor decomposition </a>
  
    <a class="content-tag" href="/tags/#svd"> SVD </a>
  </span>

  <span id="iclr-post-authors" class="post-date"></span>
  <p>As soon as the <a href="https://iclr-conf.medium.com/announcing-iclr-2021-outstanding-paper-awards-9ae0514734ab">ICLR 2021’s Outstading Paper Awards</a> were announced, one paper immediately caught my attention from the list: “<a href="https://openreview.net/forum?id=rcQdycl0zyk">Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with \(1/n\) Parameters</a>”. Although the title might seem a bit daunting to those unfamiliar with quaternion algebra, the authors provide enough context to understand the core components of their approach and how they derive a new layer construction that generalizes the inductive biases behind quaternion-based Neural Networks, achieving sub-linear parameter scaling for Fully-Connected layers in Natural Language applications.</p>

<p>While it is not mentioned in the paper, it turns out that the effectiveness of the proposed layer scheme can also be explained using a much more widely known concept: the Singular Value Decomposition (SVD). In this blog post we will review the parametrization proposed by the authors, which links quaternion-based Neural Networks with the Kronecker Product, and later explain how the Kronecker Product provides a connection with a parallel line of research into parameter efficient Neural Networks for Computer Vision based on SVD.</p>

<h3 id="the-phm-layer">The PHM Layer</h3>

<p>The main contribution of the paper is the “parameterized
hypercomplex multiplication (PHM) layer”, a new layer that can replace fully connected (FC) layers with high parameter efficiency.</p>

<p>Instead of having a normal FC layer like this:</p>

\[\bf{y} = FC(\bf{x}) = \bf{W}x + b\]

<p>We would have a PHM layer:</p>

\[\bf{y} = PHM(\bf{x}) = \bf{H}x + b\]

<p>For both layers we are learning a linear mapping (\(\bf{W}\) or \(\bf{H} \in \mathbb{R}^{k \times d}\)) of the input \(\bf{x}\).</p>

<p>To have a clear understanding of what the proposed layer does, the authors introduce the Kronecker Product. For matrices \(\bf{A} \in \mathbb{R}^{m \times n}\) and \(\bf{B} \in \mathbb{R}^{p \times q}\), the Kronecker Product \(\otimes\) is defined as:</p>

\[\begin{align*}
\bf{A} \otimes \bf{B} = \begin{bmatrix}
    a_{11}\bf{B} &amp; \dots  &amp; a_{1n}\bf{B} \\
    \vdots &amp; \ddots &amp; \vdots \\
    a_{m1}\bf{B} &amp; \dots  &amp; a_{mn}\bf{B}
    \end{bmatrix} \in \mathbb{R}^{mp \times nq}
\end{align*}\]

<p>The end result of applying the Kronecker Product to two matrices, is another matrix (a block matrix). With the assumption that both dimensions \(k\) and \(d\) are divisible by a user-selected positive integer \(n\), the matrix \(\bf{H}\) from the PHM layer can now be defined:</p>

\[\begin{align}
\bf{H} = \sum_{i=1}^n \bf{A_i} \otimes \bf{S_i}
\end{align}\]

<p>Where \(\bf{A_i} \in \mathbb{R}^{n \times n}\) and \(\bf{S_i} \in \mathbb{R}^{\frac{k}{n} \times \frac{d}{n}}\).</p>

<p>Such construction makes \(\bf{H}\) very efficient in terms of parameter count, with approximately \(1/n\) the number of parameters of an FC layer matrix \(\bf{W}\). Assuming that \(kd &gt; n^4\), which is the case for high dimensional latent spaces found in practice.</p>

<p>One of the first things I questioned after seeing this equation was the restriction of \(\bf{A_i}\) to be a square matrix. The authors provide an intuitive explanation from the point of view of quaternion multiplication (hence the name “hypercomplex multiplication”, as the quaternion number system is a kind of <a href="https://en.wikipedia.org/wiki/Hypercomplex_number">hypercomplex number system</a>). A nice property of quaternion multiplication (called the Hamilton Product), is that it can be rewritten as the following matrix:</p>

\[\begin{align}
\begin{bmatrix}
    Q_r &amp; -Q_x &amp; -Q_y &amp; -Q_z \\
    Q_x &amp; Q_r &amp; -Q_z &amp; Q_y \\
    Q_y &amp; Q_z &amp; Q_r &amp; -Q_x \\
    Q_z &amp; -Q_y &amp; Q_x &amp; Q_r \\
\end{bmatrix}
\begin{bmatrix}
    P_r \\
    P_x\\
    P_y\\
    P_z \\
\end{bmatrix},
\end{align}\]

<p>Where each subscript is associated with the quaternion unit basis.</p>

<p>This matrix can be interpreted as defining a rotation \(Q\) of a 3-Dimensional vector \(P\), which is very useful as an inductive bias to learn rotations inside Neural Networks (an experiment demonstrated in the paper). However, in its common form is not very useful for other dimensions, so the authors propose to reformulate it as the sum of Kronecker Products:</p>

<div style="overflow-x: scroll">
$$
\begin{align}
\label{eq:ASQ_kron}
\underbrace{
\begin{bmatrix}
    1 &amp; 0 &amp; 0 &amp; 0 \\
    0 &amp; 1 &amp; 0 &amp; 0 \\
    0 &amp; 0 &amp; 1 &amp; 0 \\
    0 &amp; 0 &amp; 0 &amp; 1 \\
\end{bmatrix}
}_{\bf{A_1}}
\otimes
\underbrace{
\begin{bmatrix}
    Q_r \\
\end{bmatrix}
}_{\bf{S_1}}
+
\underbrace{
    \begin{bmatrix}
    0 &amp; -1 &amp; 0 &amp; 0 \\
    1 &amp; 0 &amp; 0 &amp; 0 \\
    0 &amp; 0 &amp; 0 &amp; -1 \\
    0 &amp; 0 &amp; 1 &amp; 0 \\
\end{bmatrix}
}_{\bf{A_2}}
\otimes
\underbrace{
\begin{bmatrix}
    Q_x \\
\end{bmatrix}
}_{\bf{S_2}}
+
\underbrace{
    \begin{bmatrix}
    0 &amp; 0 &amp; -1 &amp; 0 \\
    0 &amp; 0 &amp; 0 &amp; 1 \\
    1 &amp; 0 &amp; 0 &amp; 0 \\
    0 &amp; -1 &amp; 0 &amp; 0 \\
\end{bmatrix}
}_{\bf{A_3}}
\otimes
\underbrace{
\begin{bmatrix}
    Q_y \\
\end{bmatrix}
}_{\bf{S_3}}
+
\underbrace{
    \begin{bmatrix}
    0 &amp; 0 &amp; 0 &amp; -1 \\
    0 &amp; 0 &amp; -1 &amp; 0 \\
    0 &amp; 1 &amp; 0 &amp; 0 \\
    1 &amp; 0 &amp; 0 &amp; 0 \\
\end{bmatrix}
}_{\bf{A_4}}
\otimes
\underbrace{
\begin{bmatrix}
    Q_z \\
\end{bmatrix}
}_{\bf{S_4}}
.
\end{align}
$$
</div>

<p>As can be seen, the matrices \(\bf{A_i} \in \mathbb{R}^{4 \times 4}\) and \(\bf{S_i} \in \mathbb{R}^{\frac{4}{4} \times \frac{4}{4}}\) are equivalent to the previous rotation matrix, which demonstrates that a PHM layer with \(n=4\) can learn quaternion multiplication. Given that the same result holds for \(8D\) (octonions), \(16D\) (sedenions), and the fact that \(n\) can take more values than just \(\{4, 8, 16\}\), the PHM is said to generalize hypercomplex multiplication to \(nD\).</p>

<p>To close this section about the PHM layer, I want to show the results they achieved applying the layer to machine translation, which offers great results in parameter efficiency without sacrificing much performance:</p>

<p align="center">
  <img width="60%" src="/public/images/2021-12-01-kronecker-effective/phm_transformer_results.jpeg" />
</p>

<h3 id="why-are-kronecker-products-effective-then">Why are Kronecker Products effective then?</h3>

<p>This paper caught my attention because I had read about the Kronecker Product being used in a similar manner for Convolutional Neural Networks. In particular, a 2015 paper called <a href="https://arxiv.org/abs/1512.09194"><strong>Exploiting Local Structures with the Kronecker Layer in Convolutional Networks</strong></a>.</p>

<p>In this paper two new types of layers are proposed. First the Kronecker Fully-Connected (KFC) layer:</p>

\[\begin{align}
\mathbf{L_{i+1}} = f\left(\left(\sum_{i=1}^{r} \bf{A_i} \otimes \bf{B_i}\right)\bf{L_i} + \bf{b}_i\right),
\end{align}\]

<p>where \(\bf{A_i} \in \mathbb{R}^{m^{(i)}\times n^{(i)}}\) and \(\bf{B_i} \in \mathbb{R}^{\frac{k}{m^{(i)}} \times \frac{d}{n^{(i)}}}\).</p>

<p>And its generalization, the Kronecker Convolutional (KConv) layer, which approximates a convolutional kernel as follows:</p>

\[\begin{align}
\mathcal{W} \approx \sum_{i=1}^{r}\mathcal{A}_i\otimes\mathcal{B}_i,
\end{align}\]

<p>where \(\mathcal{A}_i\) and \(\mathcal{B}_i\) are \(4D\) tensors with similar shape constraints as the cases before. Also note that there is no restriction for either \(\bf{A_i}\) or \(\mathcal{A}_i\) to be square, or for the number of summed products \(r\) to be equal to \(n\).</p>

<p>In contrast to PHM, the authors of the KConv paper arrive to the sum of Kronecker Products not by construction, but by improving on the ideas about low rank decomposition of Convolutional Neural Networks proposed in <a href="https://arxiv.org/abs/1405.3866"><strong>Speeding up convolutional Neural Networks with low rank expansions</strong></a> and <a href="https://arxiv.org/abs/1404.0736"><strong>Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation</strong></a>, amongst others.</p>

<p>Particular emphasis must be made on their use of the duality between approximating weight matrices using the sum of Kronecker Products and SVD. This duality is demonstrated in Section 5.5 of the paper <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1924&amp;rep=rep1&amp;type=pdf"><strong>Approximation with Kronecker Products</strong></a>, and it is crucial, as it gives a hint on why such parametrizations work in practice:</p>

<p><em>“Next, we consider the situation when the matrix A to be approximated is a sum of Kronecker products:</em></p>

\[A=\sum_{i=1}^{p}\left(G_{i} \otimes F_{i}\right) .\]

<p><em>Assume that each \(G_i \in \mathbb{R}^{m1 \times n1}\) and each \(F_i\in \mathbb{R}^{m2 \times n2}\). It follows that if \(f_i = vec(Fi)\) and \(g_i = vec(G_i)\), then:</em></p>

\[\tilde{A}=\mathcal{R}(A)=\sum_{i=1}^{p} \mathcal{R}\left(G_{i} \otimes F_{i}\right)=\sum_{i=1}^{p} g_{i} f_{i}^{T}\]

<p><em>is a rank-\(p\) matrix.”</em></p>

<p>While explaining the rearrangement operation \(\mathcal{R}(A)\) is beyond the scope of this post (I highly encourage you to read the paper), this result shows how solving the problem of approximating a matrix \(A\) with the sum of \(p\) Kronecker Products is equivalent to the rank-\(p\) SVD of a rearranged version of \(A\).</p>

<p>As it is the case with many signals in the real world, the intrinsic dimensionality of the transformer weights in the PHM paper is likely to be small. As such, low-rank approximations might be able to capture most of the model behavior with few parameters, explaining the efficiency of the Kronecker Product approach.</p>

<h3 id="final-thoughts">Final thoughts</h3>

<p>I first came across the Kronecker Product back in 2018, when I worked on a university course presentation about incorporating large scale context in Neural Networks. Although I started by looking at what was novel at the time (<a href="https://arxiv.org/abs/1703.06211"><strong>Deformable CNNs</strong></a> and the Atrous Spatial Pyramid Pooling scheme in <a href="https://arxiv.org/abs/1706.05587"><strong>Deeplab</strong></a>), it wasn’t until I found a great blog post by Ferenc Huszár outlining the relationship between <a href="https://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/">Dilated Convolutions and Kronecker Factored Convolutions</a> that I became captivated by the subject.</p>

<p>As it turns out, there has been a long chain of papers on making Neural Networks efficient, all with different takes on which is the best way to do Matrix (or Tensor) Decomposition. From the ideas that influenced the development of the KConv layers to the novel connection with hypercomplex multiplication proposed with the PHM layer, I have become convinced that the Kronecker Product, and the inductive biases it can encode, will be a crucial tool in the path to understanding Neural Networks.</p>

<h2 id="references">References</h2>

<ul>
  <li>Zhang, Aston, et al. “Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with \(1/n\) Parameters.” arXiv preprint <a href="https://openreview.net/forum?id=rcQdycl0zyk">arXiv:2102.08597</a> (2021).</li>
  <li>Zhou, Shuchang, et al. “Exploiting local structures with the kronecker layer in convolutional networks.” arXiv preprint <a href="https://arxiv.org/abs/1512.09194">arXiv:1512.09194</a> (2015).</li>
  <li>Jaderberg, Max, Andrea Vedaldi, and Andrew Zisserman. “Speeding up convolutional Neural Networks with low rank expansions.” arXiv preprint <a href="https://arxiv.org/abs/1405.3866">arXiv:1405.3866</a> (2014).</li>
  <li>Denton, Emily, et al. “Exploiting linear structure within convolutional networks for efficient evaluation.” arXiv preprint <a href="https://arxiv.org/abs/1404.0736">arXiv:1404.0736</a> (2014).</li>
  <li>Van Loan, Charles F., and Nikos Pitsianis. <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1924&amp;rep=rep1&amp;type=pdf">“Approximation with Kronecker products.”</a> Linear algebra for large scale and real-time applications. Springer, Dordrecht, 1993. 293-314.</li>
  <li>Dai, Jifeng, et al. <a href="https://arxiv.org/abs/1703.06211">“Deformable convolutional networks.”</a> Proceedings of the IEEE international conference on computer vision. 2017.</li>
  <li>Chen, Liang-Chieh, et al. “Rethinking atrous convolution for semantic image segmentation.” arXiv preprint <a href="https://arxiv.org/abs/1706.05587">arXiv:1706.05587</a> (2017).</li>
  <li>“Dilated Convolutions and Kronecker Factored Convolutions.” <a href="https://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/">https://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/</a></li>
</ul>

</div>

<div id="bibtex-container" class="related">
  For attribution in academic contexts, please cite this work as
  <pre id="bibtex-academic-attribution">

  </pre>

  BibTeX citation
  <pre id="bibtex-box">

  </pre>
</div>
<script>
  let authorsSpan = document.getElementById("iclr-post-authors");
  let authorsText = authorsSpan.textContent;
  let lnameFnameInstitution = authorsText.split(";");
  let lfiList = lnameFnameInstitution.map(lfi => lfi.split(",").map(item => item.trim()));
  let bibtexLFI = lfiList.map(lfi => lfi[0] + ", " + lfi[1]).join(" and ")
  let academicLFI = lfiList.map(lfi => lfi[0]);
  {
    if(academicLFI.length > 2) academicLFI = academicLFI[0] + ", et al.";
    else if(academicLFI.length == 2) academicLFI = academicLFI[0] + " & " + academicLFI[1];
    else academicLFI = academicLFI[0];
  }

  let titleSpan = document.getElementById("iclr-post-title");
  let titleText = titleSpan.textContent.trim();
  let bibtexTitleShorthand = (lfiList[0][1]+
    "2022"+
    titleText.split(" ").slice(0, 3).join("")
  ).replace(" ", "").replace(/[\p{P}$+<=>^`|~]/gu, '').toLowerCase().trim();

  let bibtexTemplate = `
@inproceedings{${bibtexTitleShorthand}},
  author = {${bibtexLFI}},
  title = {${titleText}},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {${window.location.href}},
  url  = {${window.location.href}}
}
  `.trim();
  document.getElementById("bibtex-box").innerText = bibtexTemplate;

  let academicTemplate = `
${academicLFI}, "${titleText}", ICLR Blog Track, 2022.
`.trim();
  document.getElementById("bibtex-academic-attribution").innerText = academicTemplate;

</script>


<div class="related">
  <h2>Related posts</h2>
  <ul class="related-posts">
    
      <li>
        <h3>
          <a href="/2021/09/01/sample-submission/">
            Sample Submission
            <small>01 Sep 2021 | 
    <a class="content-tag" href="/tags/#kronecker"> kronecker </a>
  
    <a class="content-tag" href="/tags/#quaternion"> quaternion </a>
  
    <a class="content-tag" href="/tags/#parameter-efficient"> parameter efficient </a>
  
    <a class="content-tag" href="/tags/#tensor-decomposition"> tensor decomposition </a>
  
    <a class="content-tag" href="/tags/#svd"> SVD </a>
  </small>
          </a>
        </h3>
      </li>
    
      <li>
        <h3>
          <a href="/2020/04/02/example-content/">
            Example content (Basic Markdown)
            <small>02 Apr 2020 | 
    <a class="content-tag" href="/tags/#kronecker"> kronecker </a>
  
    <a class="content-tag" href="/tags/#quaternion"> quaternion </a>
  
    <a class="content-tag" href="/tags/#parameter-efficient"> parameter efficient </a>
  
    <a class="content-tag" href="/tags/#tensor-decomposition"> tensor decomposition </a>
  
    <a class="content-tag" href="/tags/#svd"> SVD </a>
  </small>
          </a>
        </h3>
      </li>
    
  </ul>
</div>


<script src="https://utteranc.es/client.js"
        repo="iclr-blog-track/iclr-blog-track.github.io"
        issue-term="pathname"
        label="utterance"
        theme="boxy-light"
        crossorigin="anonymous"
        >
</script>


      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
