Contrasting LMU with LSTM
01 Sep 2021 | machine-learningBoth Hidden Markov Model (HMM) and Recurrent Neural Network (RNN) suffer from disappearing transitions and (vanishing \& exploding) gradient problems. LSTM maintains a long time-range dependency on a sequencing task. However, information flow in the network tends to saturate once the number of time steps exceeds a few thousand. Legendre Memory Unit (LMU) is a revolutionary evolution on the design of RNN that can conveniently handle extremely long-range dependency. Let’s try to figure out why the LMU exceeds the performance of the LSTM in this blog. We proceed by reviewing the paper that introduced LMU. This article is a summary of the paper titled Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks published in NeurIPS 2019. Information about the paper are as follows:
- Authors: Aaron R. Voelker, Ivana Kajic, Chris Eliasmith
- Source code: Keras implementation of LMU.
- Video: video link
We will not rehash the description of LSTM in this blog as Chris Olah has already given an excellent explanation in his blog post titled Understanding LSTM Networks. We provide a quick summary of his blog to set the stage for understanding LMU.
What is a Legendre transforms, how does it relate to the popular Fourier transforms?
Legendre transformation is a self-inverse transformation of a quantity into a set of convex functions. For example, in classical mechanics, we can decompose “position” into conjugate quantity (e.g velocity, acceleration, and momentum). There exists a specialization of the Legendre transformation to non-convex functions. Fourier transformation is a function that converts quantities from the time domain into frequency domains. On the contrary, inverse Fourier transforms to convert from the frequency domain back into the time domain. Legendre transformation models the problem as a convex set defined by many supporting hyperplanes. We contrast with Fourier transforms where any function can be modeled as a weighted combination of sine (cosine) functions. The conjugate quantity in Legendre transformation can be likened to the orthonormal basis in the Fourier transformation. These transformations (Legendre transformation, Fourier transformation) have wide applicability in several applied domains. For more information on Legendre transformation.
LSTM
LSTM depends on a mix of gating mechanisms and non-linearity to minimize the vanishing and exploding gradient problems that occur in default RNNs. Some of the vanishing gradient problems are due to saturation. Saturation leads to loss of performance in LSTM, and as a result, several variants of LSTM have been proposed in the literature to minimize the effect of saturation. The reliance on squeezing function ( such as sigmoid, tanh for their gating effects) plays a role in an increased chance of saturation, inadvertently affecting the ability to capture long time-range dependency in a sequence. The LSTM has the weights of its parameters initialized to random values. Unfortunately, this problematic decision could have a significant impact on the quality of the optimization.
LMU
LMU makes use of fewer computational resources to maintain long-range time dependencies by decomposing time history in a $d$, number of Ordinary Differential Equation (ODE) where the sliding window is represented by using Legendre polynomials with at most $d-1$ degree. LMU can handle extremely long-time range dependency using fewer internal states to conveniently capture the dynamics of the sequence within long time intervals. Let us rehash the mathematical formulation of LMU and use this knowledge to understand why the LMU is superior in performance to the LSTM. The cell begins with F(s) as shown in the LMU paper. $\theta$ is a strictly positive scalar value and represents the size of the window. \begin{equation} F(s) = e^{-\theta s} \end{equation} Taking the natural log of both sides results in \begin{equation} ln F(s) = ln e^{-\theta s} \end{equation} \begin{equation} ln F(s) = -\theta s \end{equation} The paper makes the assumption that $s$ is equivalent to state vector, $m(t) \in R^{d}$. $\hat{m(t)}$ is updated value of state vector, and $m(t)$ is current value of state vector. Equation 1 of the paper shows $\theta \hat{m(t)} = Am(t) + Bu(t)$. This is a common formulation in dynamic system modeling and hence a link to this blog that discusses Kalman filters. In pursuance of the desired long-range dependency, it is desirable to ensure that $A, B$ are initialized using Equation 2 in the paper as this routine has sound theoretical underpinnings. The mathematical origins of the routine follow from Pade’s approximation and Legendre polynomials. This scheme results in an architecture that is less likely to saturate as the long time-range dependency is preserved even with smaller time steps. Using the mathematical formulation in the paper alone. Let us try to write a simplistic pseudo-code to describe the algorithm of LMU in use. With the loss of generality, let us assume a batch size of 1 and the number of iterations set to 1.
- Initialize A, B, m(t)
- Run the code block every epoch
- for $t$ in $\theta … n$
- for $\hat{\theta}$ in $t-\theta$ … $t$ # handle boundaries
- $u(t - \hat{\theta}) = \sum_{i=0}^{d-1} P_{i} \left( \frac{\hat{\theta}}{\theta}\right) m_{i}(t)$ using $P_{i}$ as one of the basis Legendre polynomial from Equation 3 in paper.
- $\hat{m(t)} = \frac{Am(t)}{\theta} + \frac{Bu(t)}{\theta}$
- $m(t) = \hat{m(t)}$ # update the state vector
- update $A, B$ until convergence
The behavior of the LMU can be controlled by setting the window, $\theta$, and the size ($d$) of the state vector, $m(t)$. We can make a storage space versus performance trade-off on a sequencing task by carefully tuning these parameters.
For example, higher values of $d$ can increase the capacity of the cell to retain information that spans long time intervals. Conversely, smaller values of $d$ result in the opposite effects. The effect of modifying the sliding window, $\theta$ is similar ef to the parameter, $d$. As a result, $m(t)$ and $\theta$ are the memory of the dynamic system.
ODE is making a resurgent in the neural world. For example, the LMU unit makes use of an ODE solver. Here is an excellent tutorial on the ODE. One way to view ResNet is to assume it employs an implicit ODE. The reasoning follows that it has a structure similar to the Euler method for solving ODE by beginning with the input and adding gradient residuals at each layer.
Conclusion
The LMU would serve as a better building block for creating encode-decoder architecture to improve sequence-to-sequence modeling tasks. Other structures that work with LSTM are interchangeable with LMU e.g. bidirectional LMU.