Recall the encoder-decoder architecture we used in the BiLSTM-LSTM notbook:
%kokoyi
\Module {Seq2Seq} {x, y; Enc, Dec}
h_x \gets Enc(x) \\
\hat{y} \gets Dec(y, h_x) \\
\Return \hat{y} \\
\EndModule
\Module {Enc} {x; EncLayers}
L \gets |EncLayers| \\
h[0 \leq l \leq L-1] \gets \begin{cases}
EncLayers[l](x) & l = 0 \\
EncLayers[l](h[l-1]) & otherwise \\
\end{cases} \\
\Return h[L-1] \\
\EndModule
\Module {Dec} {y, h_x; DecLayers}
L \gets |DecLayers| \\
h[0 \leq l \leq L-1] \gets \begin{cases}
DecLayers[l](y, h_x) & l = 0 \\
DecLayers[l](h[l-1], h_x) & otherwise \\
\end{cases} \\
\Return h[L-1] \\
\EndModule
And compare this with the top-level Transformer as written below:
%kokoyi
\Module{Transformer}{\{x\}^N, \{y\}^M ; EncLayers, DecLayers, W}
(L_E, L_D, d) \gets (|EncLayers|, |DecLayers|, |x[0]|) \\
\hat{x} \gets \{x[i] + Enc_{pos}(i,d)\}_{i=0}^{N-1} \Comment{Positional Encoding.}\\
h[0 \leq l \leq L_E - 1] \gets \begin{cases}
EncLayers[0](\hat{x}) & l = 0 \\
EncLayers[l](h[l - 1]) & otherwise \\
\end{cases} \\
h_x \gets h[L_E - 1] \\
\hat{y} \gets \{y[i] + Enc_{pos}(i,d)\}_{i=0}^{M-1} \Comment{Positional Encoding.}\\
k[0 \leq l \leq L_D - 1] \gets \begin{cases}
DecLayers[0](\hat{y}, h_x) & l = 0 \\
DecLayers[l](k[l - 1], h_x) & otherwise \\
\end{cases} \\
\Return k[L_D - 1] @ W \\
\EndModule
\Function{Enc_{pos}}{pos, d_{model}}
PE(i) \gets
\begin{cases}
\sin(\frac{pos}{10000 ** (i/d_{model})}) & i \mod 2 = 0 \\
\cos(\frac{pos}{10000 ** ((i-1)/d_{model})}) & otherwise \\
\end{cases} \\
\Return \{ PE(i) \}_{i=0}^{d_{model}-1} \\
\EndFunction
ANTLR runtime and generated code versions disagree: 4.8!=4.7.2 ANTLR runtime and generated code versions disagree: 4.8!=4.7.2
You notice that the high-level encoder-decoder architecture is almost the same. The crucial difference of Transformer, given that a sequence model is nothing but a conditional probability computes $p(y_t|y_{<t}, x)$, is the way it processes all tokens in $y$ in parallel from input (remember we are doing teacher forcing training). In contrast, our earlier model predicts $y_{[t]}$ given $x$ and $y_{[0:t-1]}$ sequentially. Taking all input tokens at once loses their positional information. As a remedy, Transformer applies a positional encoding $Enc_{pos}$ on the input sequences (both source and target).
One important advantage of Transformer is to achieve parallelism while preserving the sequence prediction nature, and this is reflected in a couple of places, as we will show shortly.
Let's first look at the next level of details, as depicted in the diagram below:
Let's look at the encoder. Recall we did Bi-LSTM where the representation of each position fuses from its neighbors of both sides. To push it to the extreme, Transformer pulls from all positions, i.e. for position $w$ and let $S$ as the collection of all positions in the sentence, we compute $h^l[w] \gets f(h^{l-1}[w], g(h^{l-1}[v]|v \in S))$ with some function $f$ and $g$ at level $l$.
This is done with the $\mathrm{MHAttn}$ block below, which reuses the attention module $\mathrm{Attn}$ we had developed earlier. $\mathrm{MHAttn}$ adds a few more things:
%kokoyi
\Function{Attn}{\{q\}^N, \{k\}^M, \{v\}^M}
d \gets |q[0]| \\
a \gets \{ \Softmax(\{\frac{\trans{q[i]} @ k[j]}{\sqrt{d}} \}_{j = 0}^{M - 1}) \}_{i=0}^{N-1} \\
\Return a @ v \\
\EndFunction
\Module{MHAttn}{\{x\}^N, \{h\}^M ; W^q, W^k, W^v, W^o}
L \gets |W^q| \\
v \gets \{ Attn(x @ W^q[l], h @ W^k[l], h @ W^v[l]) \}_{l=0}^{L-1} \\
res \gets \{ \concat_{l = 0}^{L - 1}{ v[l, i] } \}_{i = 0}^{N - 1}\\
\Return res @ W^o \\
\EndModule
Now we are ready to build the encoder module. Note that:
%kokoyi
\Module {FFN} {x ; Linears}
\Return Linears[1](\ReLU(Linears[0](x))) \\
\EndModule
\Module{EncLayer}{\{x\}^N ; MHAttn, LayerNorms, FFN}
x' \gets MHAttn(x, x) \\
u \gets LayerNorms[0](x + x') \\
u' \gets FFN(u) \\
\Return LayerNorms[1](u + u') \\
\EndModule
The decoder is very similar to the encoder except that it needs to attend to the encoder states -- this part is no different than BiLSTM-LSTM translator we see earlier. As is the case of encoder, decoder also performs attention internally, with one crucial difference: to reflect the sequence prediction nature, a position never attends to future positions (i.e. $y_{[t]}$ only attends to $y_{[0:t-1]}$. To do so, we need to define a different attentin module:
%kokoyi
\Function{MaskedAttn}{\{q\}^N, \{k\}^N, \{v\}^N}
d \gets |q[0]| \\
a \gets \{ \Softmax(\{\frac{\trans{q[i]} @ k[j]}{\sqrt{d}} \}_{j = 0}^{i}) \}_{i=0}^{N-1} \\
\Return a @ v \\
\EndFunction
\Module{MaskedMHAttn}{\{x\}^N, \{h\}^M ; W^q, W^k, W^v, W^o}
L \gets |W^q| \\
v \gets \{ MaskedAttn(x @ W^q[l], h @ W^k[l], h @ W^v[l]) \}_{l=0}^{L-1} \\
res \gets \{ \concat_{l = 0}^{L - 1}{ v[l, i] } \}_{i = 0}^{N - 1}\\
\Return res @ W^o \\
\EndModule
Now we are ready to write out the Transformer decoder. Note that we do a partial attention with the decoder states first, and followed with full attention to the encoder state so that the precition will be conditioned on $x$.
%kokoyi
\Module{DecLayer}{\{y\}^N, \{h\}^M ; MaskedMHAttn, MHAttn, FFN, LayerNorms}
y' \gets MaskedMHAttn(y, y) \\
u \gets LayerNorms[0](y + y') \\
u' \gets MHAttn(u, h) \\
v \gets LayerNorms[1](u + u') \\
v' \gets FFN(v) \\
\Return LayerNorms[2](v + v') \\
\EndModule
We now complete the rest part of NN module definitions in Python using the templates generated by Kokoyi.
import torch
import math
class Transformer(torch.nn.Module):
def __init__(self, L, num_heads, embed_dim, hidden_dim, out_dim):
super().__init__()
# Change the codes below to initialize module members.
self.EncLayers = torch.nn.ModuleList(
[EncLayer(num_heads, embed_dim, hidden_dim) for _ in range(L)])
self.DecLayers = torch.nn.ModuleList(
[DecLayer(num_heads, embed_dim, hidden_dim) for _ in range(L)])
self.W = torch.nn.Parameter(torch.Tensor(embed_dim, out_dim))
gain = torch.nn.init.calculate_gain('relu')
torch.nn.init.xavier_uniform_(self.W, gain=gain)
def get_parameters(self):
# Return module members in its declaration order.
return self.EncLayers, self.DecLayers, self.W
forward = kokoyi.symbol[r"Transformer"]
class MHAttn(torch.nn.Module):
def __init__(self, num_heads, embed_dim, hidden_dim):
super().__init__()
factor = math.sqrt(embed_dim + hidden_dim) / math.sqrt(embed_dim + num_heads * hidden_dim)
self.Wq = torch.nn.Parameter(torch.Tensor(num_heads, embed_dim, hidden_dim))
self.Wk = torch.nn.Parameter(torch.Tensor(num_heads, embed_dim, hidden_dim))
self.Wv = torch.nn.Parameter(torch.Tensor(num_heads, embed_dim, hidden_dim))
self.Wo = torch.nn.Parameter(torch.Tensor(num_heads * hidden_dim, embed_dim))
torch.nn.init.xavier_uniform_(self.Wq, gain=factor)
torch.nn.init.xavier_uniform_(self.Wk, gain=factor)
torch.nn.init.xavier_uniform_(self.Wv, gain=factor)
torch.nn.init.xavier_uniform_(self.Wo, gain=factor)
def get_parameters(self):
# Return module members in its declaration order.
return self.Wq, self.Wk, self.Wv, self.Wo
forward = kokoyi.symbol[r"MHAttn"]
class FFN(torch.nn.Module):
def __init__(self, embed_dim, hidden_dim):
super().__init__()
# Change the codes below to initialize module members.
self.Linears = torch.nn.ModuleList([
kokoyi.nn.Linear(embed_dim, hidden_dim),
kokoyi.nn.Linear(hidden_dim, embed_dim),
])
def get_parameters(self):
# Return module members in its declaration order.
return self.Linears
forward = kokoyi.symbol[r"FFN"]
class EncLayer(torch.nn.Module):
def __init__(self, num_heads, embed_dim, hidden_dim):
super().__init__()
self.attn = MHAttn(num_heads, embed_dim, hidden_dim)
self.norm1 = kokoyi.nn.LayerNorm(embed_dim)
self.norm2 = kokoyi.nn.LayerNorm(embed_dim)
self.ffn = FFN(embed_dim, hidden_dim)
def get_parameters(self):
# Return module members in its declaration order.
return self.attn, [self.norm1, self.norm2], self.ffn
forward = kokoyi.symbol[r"EncLayer"]
class MaskedMHAttn(torch.nn.Module):
def __init__(self, num_heads, embed_dim, hidden_dim):
super().__init__()
factor = math.sqrt(embed_dim + hidden_dim) / math.sqrt(embed_dim + num_heads * hidden_dim)
self.Wq = torch.nn.Parameter(torch.Tensor(num_heads, embed_dim, hidden_dim))
self.Wk = torch.nn.Parameter(torch.Tensor(num_heads, embed_dim, hidden_dim))
self.Wv = torch.nn.Parameter(torch.Tensor(num_heads, embed_dim, hidden_dim))
self.Wo = torch.nn.Parameter(torch.Tensor(num_heads * hidden_dim, embed_dim))
torch.nn.init.xavier_uniform_(self.Wq, gain=factor)
torch.nn.init.xavier_uniform_(self.Wk, gain=factor)
torch.nn.init.xavier_uniform_(self.Wv, gain=factor)
torch.nn.init.xavier_uniform_(self.Wo, gain=factor)
def get_parameters(self):
# Return module members in its declaration order.
return self.Wq, self.Wk, self.Wv, self.Wo
forward = kokoyi.symbol[r"MaskedMHAttn"]
class DecLayer(torch.nn.Module):
def __init__(self, num_heads, embed_dim, hidden_dim):
super().__init__()
self.mask_attn = MaskedMHAttn(num_heads, embed_dim, hidden_dim)
self.attn = MHAttn(num_heads, embed_dim, hidden_dim)
self.ffn = FFN(embed_dim, hidden_dim)
self.norm1 = kokoyi.nn.LayerNorm(embed_dim)
self.norm2 = kokoyi.nn.LayerNorm(embed_dim)
self.norm3 = kokoyi.nn.LayerNorm(embed_dim)
def get_parameters(self):
# Return module members in its declaration order.
return self.mask_attn, self.attn, self.ffn, [self.norm1, self.norm2, self.norm3]
forward = kokoyi.symbol[r"DecLayer"]
We use the same setup from the Seq2Seq_LSTM tutorial to train the Transformer model for machine translation task.
import os
import torch
import torchtext
from collections import Counter
from torchtext.datasets import IWSLT2016
from torch.utils.data import DataLoader
from torchtext.vocab import vocab, build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F
import kokoyi
We will use IWSLT2016 dataset from torchtext. We train our model on the German-English subset that consists of bilingual sentence pairs. Each text sequence is tokenized into a sequence of integers and padded into the same length.
if not os.path.exists('data'):
os.mkdir('data')
BATCH_SIZE = 16
MAX_LEN = 100
print("Creating dataset ...")
train_iter = IWSLT2016(root='data', split='train', language_pair=('de', 'en'))
test_iter = IWSLT2016(root='data', split='test', language_pair=('de', 'en'))
train_dataset = list(train_iter)
test_dataset = list(test_iter)
print("Building vocab ...")
# tokenizers
de_tokenizer = get_tokenizer('spacy', language='de')
en_tokenizer = get_tokenizer('spacy', language='en')
# build vocab
def yield_tokens(dataset, idx, tokenizer):
for sentence in dataset:
yield tokenizer(sentence[idx])
de_vocab = build_vocab_from_iterator(yield_tokens(train_dataset, 0, de_tokenizer), specials=['<pad>', '<bos>', '<eos>', '<unk>'])
en_vocab = build_vocab_from_iterator(yield_tokens(train_dataset, 1, en_tokenizer), specials=['<pad>', '<bos>', '<eos>', '<unk>'])
de_vocab.set_default_index(de_vocab['<unk>'])
en_vocab.set_default_index(en_vocab['<unk>'])
print('src_vocab_size', len(de_vocab))
print('tgt_vocab_size', len(en_vocab))
print("Text to vocab IDs ...")
# convert to tensor
text_pipeline = lambda x,vocab, tokenizer: [vocab['<bos>']] + [vocab[token] for token in tokenizer(x)] + [vocab['<eos>']]
def _convert_data(dataset):
new_dataset = []
for _de, _en in dataset:
new_de = torch.tensor(text_pipeline(_de, de_vocab, de_tokenizer))
new_en = torch.tensor(text_pipeline(_en, en_vocab, en_tokenizer))
if len(new_de) <= MAX_LEN and len(new_en) <= MAX_LEN:
new_dataset.append((new_de, new_en))
return new_dataset
train_dataset = _convert_data(train_dataset)
test_dataset = _convert_data(test_dataset)
print("Train set: %d" % len(train_dataset))
print("Test set: %d" % len(test_dataset))
def collate_batch(batch):
de_batch, en_batch = [], []
for (_de, _en) in batch:
de_batch.append(_de)
en_batch.append(_en)
de_batch = pad_sequence(de_batch, padding_value=de_vocab['<pad>'], batch_first=True)
en_batch = pad_sequence(en_batch, padding_value=en_vocab['<pad>'], batch_first=True)
return de_batch, en_batch
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
shuffle=True, collate_fn=collate_batch)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
kokoyi.set_rt_device(device)
print('Device', device)
num_epochs = 3
embed_size = 16
hidden_size = 64
num_layers = 2
num_heads = 2
src_vocab_size = len(de_vocab)
tgt_vocab_size = len(en_vocab)
src_embedding = torch.nn.Parameter(torch.empty((src_vocab_size, embed_size), device=device))
tgt_embedding = torch.nn.Parameter(torch.empty((tgt_vocab_size, embed_size), device=device))
gain = torch.nn.init.calculate_gain('relu')
torch.nn.init.xavier_uniform_(src_embedding, gain=gain)
torch.nn.init.xavier_uniform_(tgt_embedding, gain=gain)
model = Transformer(num_layers, num_heads, embed_size, hidden_size, tgt_vocab_size).to(device)
print(model)
parameters = list(model.parameters()) + [src_embedding, tgt_embedding]
optimizer = torch.optim.Adam(parameters)
for epoch in range(num_epochs):
total_loss, n_word_total, n_word_correct = 0, 0, 0
for i, (de, en) in enumerate(train_dataloader):
# prepare data
de, en = de.to(device), en.to(device)
src_seq = de
tgt_seq, gold = en[:, :-1], en[:, 1:]
# Look up the embedding table
src_emb = F.embedding(src_seq, src_embedding, padding_idx=de_vocab['<pad>'])
tgt_emb = F.embedding(tgt_seq, tgt_embedding, padding_idx=en_vocab['<pad>'])
# forward
optimizer.zero_grad()
pred = model(src_emb, tgt_emb, batch_level=[1,1])
# backward and update parameters
pred = pred.reshape(-1, pred.size(2))
gold = gold.contiguous().view(-1)
loss = F.cross_entropy(pred, gold, ignore_index=en_vocab['<pad>'], reduction='mean')
pred = pred.max(1)[1]
non_pad_mask = gold.ne(en_vocab['<pad>'])
n_correct = pred.eq(gold).masked_select(non_pad_mask).sum().item()
n_word = non_pad_mask.sum().item()
loss.backward()
optimizer.step()
# Note keeping
n_word_total += n_word
n_word_correct += n_correct
total_loss += loss.item()
if i % 5 == 0:
print(f'Epoch {epoch:04d} | Iter {i:04d} | Loss {loss.item():.4f} | Acc {(n_correct / n_word):.4f}')
print(f'Epoch {epoch:04d} | Avg Loss {(total_loss / i):.4f} | Avg Acc {(n_word_correct / n_word_total):.4f}')