Seq2Seq: Part 1

This notebook is an example of specifying a sequence transdution model that transforms an input sequence $x\in\mathbb{R}^{N}$ to an output sequence $y\in\mathbb{R}^M$.

encoder-decoder This architecture can be described in Kokoyi as follows:

This notebook will take advantage of the teacher forcing mode, where the true labels are fed into the decoder.

The model can be trivially extended to go deep, such that both encoder and decoder have multiple layers:

Machine Translation Application

The above are warm-ups, though we can adopt the templates to generate real codes. For the time being we will developing a real LSTM-based translation system with an important mechanism, namely the attention module. In the example below, $x$ is "They", "are", "watching", ".", "<eos>" and the $y$ is "IIs", "regardent", ".", "<eos>".

image

LSTM encoder

We will reuse the LSTM we developed for doc classification. The same helper function setup:

We assume the input comprises $s$, a list of embeddings, one for each word. One modification we add here is $h_0, c_0$ which can provide initial condition to LSTM states. This is used, for instance, to condition the decoder LSTM.

An optimization: BiLSTM

BiLSTM runs a second LSTM in the reverse direction, and concatenate the states. This is a crude way of saying we don't trust the lef-to-right order encodes every information; as we shall see in Part 2, Transformer relaxes this even further.

Now we are ready to define the a translator the maps one sentence in one domain to the other:

Another optimization: attention

One of the most effictive way in machine translation is to align output to some words in the input. Since we cannot know a priori, we use attention to learn it. In order to do that we will define a new LSTM decoder that can attend to the output of the encoder:

The new LSTM decoder has an attention module $Attn$ that injects additional contextual information from the input $h_x$. Here is the definition of $Attn$ and a visualization of how it works:

image

What it does is to compute a similarity between query $q$ and key $k$, and use that to compute a distribution (using softmax), and then compute a weighted sum over the values $v$. In our case, $k$ and $v$ are output of the encoder $h_x$, and $q$ is the decoder output $h$, and that's all the modification needed:

Machine translation using Seq2Seq

Let's first do some setup:

We will use IWSLT2016 dataset from torchtext. We train our model on the German-English subset that consists of bilingual sentence pairs. Each text sequence is tokenized into a sequence of integers and padded into the same length.

You can let Kokoyi set up the initialization for the Seq2Seq modules defined above (click the button and then fill up what's needed).

Finally, we can set the hyper-parameters and start training! Note that we use teacher forcing method where the original output sequence is fed into the decoder.