Sequential Model: Topic Classification

This notebook is an example of specifying a model that maps a sequence $\{\mathbb{R}^{d_1\times d_2...}\}^N$ of tensors to a discrete value. Specifically, we are going to use LSTM to perform topic classification. In this notebook you are going to see an example of implementing a more complex module from scratch.

Recurrent Neural Networks and LSTM

Long-short term memory (LSTM) is one of the most popular architectures in the class of recurrent neural networks (RNNs). A typical RNN holds a hidden state $h$. For a binary classifier, at time step $t$, the state is updated as a linear combination of the previous state $h_{t-1}$ and the current input $x_t$ through $\mathop{sigmoid}$ activation. The activation at the final step can be used to compute loss and train the classifier.

Simple RNN

Here's an (incomplete) implementaiton of a two-class classifier in Kokoyi, with learnable parameter $W_x$, $W_h$ and $b$ in a vanilla RNN, whose final representation summarizes the sentence. We will use a the recursive pattern should be familiar to you by now:

$ h_t = \tanh (W_{h} \cdot h_{t-1} + W_{x} \cdot x_{t} + b) $

where $t$ starts from 1 and the representation arrays $h$ is initialized by $h_0$. Let $S$ be the input sentence, each is an index to a word dictionary $D$. Note that the indexing of input sequence usually starts from 0 (unless you pre-processed it), so we right-shift it by padding one <pad> at left.

LSTM

Vanilla RNN has difficulty to preserve long-term information, LSTM mitigated the problem by adding some extra units (gates) to retain memory. At time step $t$, a forget gate $f_t$, an input gate $i_t$ and an output gate $o_t$ are applied to selectively drop some old memory and collect useful new state:

$ f_t = \sigma (W_{f, h} \cdot h_{t-1} + W_{f, x} \cdot x_{t} + b_f)$
$ i_t = \sigma (W_{i, h} \cdot h_{t-1} + W_{i, x} \cdot x_{t} + b_i)$
$ o_t = \sigma (W_{o, h} \cdot h_{t-1} + W_{o, x} \cdot x_{t} + b_o)$

A candidate memory cell $\tilde{c}_t$ is maintained similarly except using $\mathop{tanh}$ as the activation function:

$ \tilde{c}_t = \tanh (W_{c, h} \cdot h_{t-1} + W_{c, x} \cdot x_{t} + b_c)$

Then, the memory cell is update with the forget gate, the input gate, the candidate memory cell and the previous one with hadamard product:

$ c_t = f_t \circ c_{t-1} + i_t \circ \tilde{c}_t $

Finally, the output of each step would be the hidden state $h_t$, calculated from the output gate and the memory cell:

$ h_t = o_t \circ \tanh (c_t)$

We will now write the model in Kokoyi straight from the definition. Let's first define some help function and module. Since the calculation of $f_t$, $i_t$, $o_t$ have identical form, we can use a module $T$ to update these gates (cells), we will reuse it for $\tilde{c}_t$ as well. Note the use of ";" to sepearate inputs from parameters. We also define an inline function $\sigma$:

Then, we can write the main model from the definitions, use the standard cross entropy as the loss function. Note that $\{W\}^L$ is a sentence of length $L$, each token in the sentence is an index to the embedding table $D$. By taking $\{W\}^L$ as an input, $L$ is available. The first statement unpacks the four $T$ modules we mentioned earlier; the second statement $\{D(w); w \in W\}$ maps each token id into its dense representation.

Note also:

fix: Aston complained about the use of cdot and I agree, the very name of cdot repurposed for matrix multiplication is problematic and not intuitive. Perhaps we should use * but render as "$\cdot$". Aston also suggests that \odot is more common for Hadamard products. Should we just use it for both Hadmard and dot-product? We should perhaps just normalize everything and don't give users a lot of choice, i.e. * for multipication, and \odot for dot/Haramard products, will we miss anything?

You can let Kokoyi set up the initialization for the LSTM (just copy and paste and then fill up what's needed):

Click here

to see the default initialization code generated by Kokoyi for this model (You can use the button above to insert such a cell while at a Kokoyi cell):
class T(torch.nn.Module):
    def __init__(self):
        """ Add your code for parameter initialization here (not necessarily the same names)."""
        super().__init__()
        self.W_x = None
        self.W_h = None
        self.b = None

    def get_parameters(self):
        """ Change the following code to return the parameters as a tuple in the order of (W_x, W_h, b)."""
        return None

    forward = kokoyi.symbol["T"]


class LSTM(torch.nn.Module):
    def __init__(self):
        """ Add your code for parameter initialization here (not necessarily the same names)."""
        super().__init__()
        self.T_s = None
        self.Linear = None
        self.D = None
        self.c_0 = None
        self.h_0 = None

    def get_parameters(self):
        """ Change the following code to return the parameters as a tuple in the order of (T_s, Linear, D, c_0, h_0)."""
        return None

    forward = kokoyi.symbol["LSTM"]

Here's the completed module definition. We import the Linear module from kokoyi.nn. NN modules in Kokoyi are basically the same as NN modules in torch. You can set up a kokoyi module with the same parameters used in torch. The forward function performs almost the same, except for some changes for auto-batching.

Topic Classification using LSTM

Let's first do some setup:

We will use the news article in the AG_NEWS dataset from torchtext. The dataset consists of several label-text pairs, each text sequence is already tokenized into a sequence of integers.

Finally, we can set the hyper-parameters and start training!

We can validate the accuracy after training with similar code. We applied a straight-forward pre-processing to simplify the tutorial, feel free to change it for a better performance!