This blog teaches you how to use PyTorch to build sequence to sequence models with attention mechanisms for machine translation tasks.
1. Introduction
In this blog, you will learn how to use PyTorch to build sequence to sequence models with attention mechanisms for machine translation tasks.
Sequence to sequence models are a type of neural network models that can map a variable-length input sequence to a variable-length output sequence. They are widely used for natural language processing (NLP) tasks such as machine translation, text summarization, speech recognition, and chatbot generation.
Attention mechanisms are a technique that allows the model to focus on the most relevant parts of the input sequence when generating the output sequence. They can improve the performance and accuracy of sequence to sequence models, especially for long sequences.
Machine translation is the task of automatically translating a text from one language to another. It is one of the most challenging and popular applications of sequence to sequence models and attention mechanisms.
PyTorch is an open-source framework that provides a flexible and easy-to-use platform for building and training deep learning models. It offers a rich set of tools and libraries for NLP, such as torchtext, transformers, and ignite.
By the end of this blog, you will be able to:
- Understand the basic concepts and components of sequence to sequence models and attention mechanisms.
- Implement a sequence to sequence model with dot-product attention and multi-head attention using PyTorch.
- Train and evaluate the model on a machine translation dataset.
- Generate translations for new sentences using the model.
Are you ready to dive into the world of sequence to sequence models and attention mechanisms with PyTorch? Let’s get started!
2. Sequence to Sequence Models
In this section, you will learn about the basic concepts and components of sequence to sequence models, which are the foundation of many NLP tasks such as machine translation.
A sequence to sequence model is a type of neural network model that can map a variable-length input sequence to a variable-length output sequence. For example, given a sentence in English as the input sequence, the model can generate a sentence in French as the output sequence.
A sequence to sequence model consists of two main parts: an encoder and a decoder. The encoder takes the input sequence and encodes it into a fixed-length vector, which is called the context vector. The context vector is supposed to capture the meaning and information of the input sequence. The decoder takes the context vector and generates the output sequence, one token at a time.
However, there are some challenges and limitations of using a simple sequence to sequence model. For instance, how can the model handle long sequences? How can the model deal with rare or unknown words? How can the model generate diverse and fluent outputs? To address these issues, researchers have proposed various techniques and extensions, such as teacher forcing, scheduled sampling, attention mechanisms, and multi-head attention.
In the next subsections, you will learn more about these techniques and how they can improve the performance and accuracy of sequence to sequence models.
2.1. Encoder-Decoder Architecture
In this subsection, you will learn more about the encoder-decoder architecture, which is the basic structure of sequence to sequence models. You will also see how to implement it using PyTorch.
The encoder-decoder architecture consists of two main components: an encoder and a decoder. The encoder is a neural network that takes the input sequence as a series of tokens and encodes it into a fixed-length vector, which is called the context vector. The context vector is supposed to capture the meaning and information of the input sequence. The decoder is another neural network that takes the context vector as input and generates the output sequence as a series of tokens, one at a time.
There are different types of neural networks that can be used as encoders and decoders, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers. In this tutorial, we will use RNNs as our encoder and decoder, since they are well-suited for sequential data. RNNs are neural networks that have a hidden state that can store information from previous inputs. They can process variable-length sequences and learn long-term dependencies.
To implement the encoder-decoder architecture using PyTorch, we need to define two classes: Encoder
and Decoder
. The Encoder
class will inherit from torch.nn.Module
, which is the base class for all neural network modules in PyTorch. The Encoder
class will have the following attributes and methods:
input_dim
: the dimension of the input vocabulary, which is the number of unique tokens in the input language.hidden_dim
: the dimension of the hidden state of the RNN, which is also the dimension of the context vector.embedding
: an embedding layer that maps the input tokens to dense vectors of sizehidden_dim
.rnn
: an RNN layer that takes the embedded input sequence and outputs the hidden state and the context vector.forward
: a method that takes the input sequence and returns the context vector.
The code for the Encoder
class is as follows:
import torch import torch.nn as nn class Encoder(nn.Module): def __init__(self, input_dim, hidden_dim): super(Encoder, self).__init__() self.input_dim = input_dim self.hidden_dim = hidden_dim self.embedding = nn.Embedding(input_dim, hidden_dim) self.rnn = nn.GRU(hidden_dim, hidden_dim) def forward(self, input_seq): # input_seq: [seq_len, batch_size] embedded = self.embedding(input_seq) # embedded: [seq_len, batch_size, hidden_dim] output, hidden = self.rnn(embedded) # output: [seq_len, batch_size, hidden_dim] # hidden: [1, batch_size, hidden_dim] return hidden # context vector
2.2. Teacher Forcing and Scheduled Sampling
In this subsection, you will learn about two techniques that can improve the training and generation of sequence to sequence models: teacher forcing and scheduled sampling. You will also see how to implement them using PyTorch.
Teacher forcing is a technique that feeds the ground truth output tokens to the decoder during training, instead of the tokens generated by the decoder in the previous time step. This can speed up the convergence and reduce the exposure bias of the model, which is the discrepancy between the training and inference scenarios. However, teacher forcing can also lead to overfitting and instability, especially when the input and output sequences have different lengths or structures.
Scheduled sampling is a technique that gradually reduces the use of teacher forcing and increases the use of the decoder’s own predictions during training. This can help the model to learn from its own mistakes and generate more diverse and robust outputs. However, scheduled sampling can also introduce inconsistency and noise, especially when the model is not confident or accurate enough.
To implement teacher forcing and scheduled sampling using PyTorch, we need to modify the Decoder
class that we defined in the previous subsection. The Decoder
class will inherit from torch.nn.Module
, which is the base class for all neural network modules in PyTorch. The Decoder
class will have the following attributes and methods:
output_dim
: the dimension of the output vocabulary, which is the number of unique tokens in the output language.hidden_dim
: the dimension of the hidden state of the RNN, which is the same as the dimension of the context vector.embedding
: an embedding layer that maps the output tokens to dense vectors of sizehidden_dim
.rnn
: an RNN layer that takes the embedded output token and the context vector as inputs and outputs the hidden state and the output vector.fc
: a linear layer that takes the output vector and maps it to the output vocabulary.forward
: a method that takes the output token, the context vector, and a teacher forcing ratio as inputs and returns the output token and the hidden state.
The code for the Decoder
class is as follows:
import torch import torch.nn as nn import random class Decoder(nn.Module): def __init__(self, output_dim, hidden_dim): super(Decoder, self).__init__() self.output_dim = output_dim self.hidden_dim = hidden_dim self.embedding = nn.Embedding(output_dim, hidden_dim) self.rnn = nn.GRU(hidden_dim, hidden_dim) self.fc = nn.Linear(hidden_dim, output_dim) def forward(self, output_token, context_vector, teacher_forcing_ratio): # output_token: [batch_size] # context_vector: [1, batch_size, hidden_dim] # teacher_forcing_ratio: a float between 0 and 1 output_token = output_token.unsqueeze(0) # output_token: [1, batch_size] embedded = self.embedding(output_token) # embedded: [1, batch_size, hidden_dim] output, hidden = self.rnn(embedded, context_vector) # output: [1, batch_size, hidden_dim] # hidden: [1, batch_size, hidden_dim] output = self.fc(output.squeeze(0)) # output: [batch_size, output_dim] if random.random() < teacher_forcing_ratio: # use the ground truth output token as the next input token output_token = output_token.squeeze(0) else: # use the model's prediction as the next input token output_token = output.argmax(1) return output, hidden
3. Attention Mechanisms
In this section, you will learn about the attention mechanisms, which are a technique that can enhance the performance and accuracy of sequence to sequence models for machine translation and other NLP tasks. You will also see how to implement two types of attention mechanisms using PyTorch: dot-product attention and multi-head attention.
Attention mechanisms are a technique that allows the decoder to focus on the most relevant parts of the input sequence when generating the output sequence. They can overcome the limitations of using a fixed-length context vector to encode the entire input sequence, which can cause information loss and degradation for long sequences. They can also improve the alignment and fluency of the output sequence, especially for languages that have different word orders or structures.
The basic idea of attention mechanisms is to compute a score or a weight for each input token, based on its similarity or relevance to the output token. The score or weight is then used to calculate a weighted sum of the input tokens, which is called the attention vector. The attention vector is then concatenated with the output token and fed to the decoder to generate the next output token.
There are different ways to compute the score or weight for each input token, such as additive attention, multiplicative attention, and scaled dot-product attention. In this tutorial, we will use dot-product attention, which is a simple and efficient way to calculate the score by taking the dot product of the input token and the output token. We will also use multi-head attention, which is a technique that allows the model to attend to different aspects or features of the input sequence by using multiple attention heads.
In the next subsections, you will learn more about these two types of attention mechanisms and how to implement them using PyTorch.
3.1. Dot-Product Attention
In this subsection, you will learn more about the dot-product attention, which is a simple and efficient way to compute the score or weight for each input token based on its similarity or relevance to the output token. You will also see how to implement it using PyTorch.
Dot-product attention is a type of attention mechanism that calculates the score by taking the dot product of the input token and the output token. The dot product is a mathematical operation that measures the angle between two vectors. The higher the dot product, the smaller the angle, and the more similar the vectors are. Therefore, the dot product can be used as a measure of similarity or relevance between the input token and the output token.
To implement dot-product attention using PyTorch, we need to define a class called DotProductAttention
. The DotProductAttention
class will inherit from torch.nn.Module
, which is the base class for all neural network modules in PyTorch. The DotProductAttention
class will have the following attributes and methods:
hidden_dim
: the dimension of the hidden state of the RNN, which is also the dimension of the input token and the output token.scale
: a scaling factor that divides the dot product by the square root of the hidden dimension. This is to prevent the dot product from becoming too large or too small, which can cause numerical instability or gradient vanishing.forward
: a method that takes the input sequence, the output token, and the context vector as inputs and returns the attention vector and the attention weights.
The code for the DotProductAttention
class is as follows:
import torch import torch.nn as nn import math class DotProductAttention(nn.Module): def __init__(self, hidden_dim): super(DotProductAttention, self).__init__() self.hidden_dim = hidden_dim self.scale = math.sqrt(hidden_dim) def forward(self, input_seq, output_token, context_vector): # input_seq: [seq_len, batch_size, hidden_dim] # output_token: [1, batch_size, hidden_dim] # context_vector: [1, batch_size, hidden_dim] score = torch.bmm(output_token, input_seq.permute(1, 2, 0)) / self.scale # score: [batch_size, 1, seq_len] weight = torch.softmax(score, dim=-1) # weight: [batch_size, 1, seq_len] attention = torch.bmm(weight, input_seq.permute(1, 0, 2)) # attention: [batch_size, 1, hidden_dim] attention = attention.permute(1, 0, 2) # attention: [1, batch_size, hidden_dim] attention = torch.cat((attention, output_token), dim=2) # attention: [1, batch_size, 2 * hidden_dim] return attention, weight
3.2. Multi-Head Attention
In this subsection, you will learn more about the multi-head attention, which is a technique that allows the model to attend to different aspects or features of the input sequence by using multiple attention heads. You will also see how to implement it using PyTorch.
Multi-head attention is a type of attention mechanism that splits the input token and the output token into multiple sub-vectors, each of which is processed by a separate attention head. The attention heads can learn different patterns or relationships between the input and output tokens, such as syntactic, semantic, or positional information. The outputs of the attention heads are then concatenated and projected to form the final attention vector.
Multi-head attention can improve the performance and accuracy of sequence to sequence models, especially for complex and diverse tasks such as machine translation. It can also increase the parallelizability and efficiency of the model, since the attention heads can be computed in parallel.
To implement multi-head attention using PyTorch, we need to define a class called MultiHeadAttention
. The MultiHeadAttention
class will inherit from torch.nn.Module
, which is the base class for all neural network modules in PyTorch. The MultiHeadAttention
class will have the following attributes and methods:
hidden_dim
: the dimension of the hidden state of the RNN, which is also the dimension of the input token and the output token.num_heads
: the number of attention heads to use.head_dim
: the dimension of each sub-vector, which is equal tohidden_dim / num_heads
.dot_product_attention
: an instance of theDotProductAttention
class that we defined in the previous subsection.fc
: a linear layer that projects the concatenated output of the attention heads to the hidden dimension.forward
: a method that takes the input sequence, the output token, and the context vector as inputs and returns the attention vector and the attention weights.
The code for the MultiHeadAttention
class is as follows:
import torch import torch.nn as nn class MultiHeadAttention(nn.Module): def __init__(self, hidden_dim, num_heads): super(MultiHeadAttention, self).__init__() self.hidden_dim = hidden_dim self.num_heads = num_heads self.head_dim = hidden_dim // num_heads self.dot_product_attention = DotProductAttention(self.head_dim) self.fc = nn.Linear(hidden_dim, hidden_dim) def forward(self, input_seq, output_token, context_vector): # input_seq: [seq_len, batch_size, hidden_dim] # output_token: [1, batch_size, hidden_dim] # context_vector: [1, batch_size, hidden_dim] batch_size = input_seq.shape[1] input_seq = input_seq.view(-1, batch_size, self.num_heads, self.head_dim) # input_seq: [seq_len, batch_size, num_heads, head_dim] input_seq = input_seq.permute(1, 2, 0, 3) # input_seq: [batch_size, num_heads, seq_len, head_dim] output_token = output_token.view(-1, batch_size, self.num_heads, self.head_dim) # output_token: [1, batch_size, num_heads, head_dim] output_token = output_token.permute(1, 2, 0, 3) # output_token: [batch_size, num_heads, 1, head_dim] context_vector = context_vector.view(-1, batch_size, self.num_heads, self.head_dim) # context_vector: [1, batch_size, num_heads, head_dim] context_vector = context_vector.permute(1, 2, 0, 3) # context_vector: [batch_size, num_heads, 1, head_dim] attention = [] weights = [] for i in range(self.num_heads): # process each attention head separately input_seq_i = input_seq[:, i, :, :] # input_seq_i: [batch_size, seq_len, head_dim] output_token_i = output_token[:, i, :, :] # output_token_i: [batch_size, 1, head_dim] context_vector_i = context_vector[:, i, :, :] # context_vector_i: [batch_size, 1, head_dim] attention_i, weight_i = self.dot_product_attention(input_seq_i, output_token_i, context_vector_i) # attention_i: [batch_size, 1, 2 * head_dim] # weight_i: [batch_size, 1, seq_len] attention.append(attention_i) weights.append(weight_i) attention = torch.cat(attention, dim=1) # attention: [batch_size, num_heads, 2 * head_dim] attention = attention.permute(1, 0, 2) # attention: [num_heads, batch_size, 2 * head_dim] attention = attention.reshape(1, batch_size, -1) # attention: [1, batch_size, 2 * hidden_dim] attention = self.fc(attention) # attention: [1, batch_size, hidden_dim] weights = torch.cat(weights, dim=1) # weights: [batch_size, num_heads, seq_len] return attention, weights
4. Machine Translation with PyTorch
In this section, you will learn how to use PyTorch to build a sequence to sequence model with dot-product attention and multi-head attention for machine translation tasks. You will also see how to train and evaluate the model on a machine translation dataset and generate translations for new sentences using the model.
Machine translation is the task of automatically translating a text from one language to another. It is one of the most challenging and popular applications of sequence to sequence models and attention mechanisms. In this tutorial, we will use the Europarl dataset, which contains parallel sentences from the proceedings of the European Parliament in 21 languages. We will focus on the English-French pair, but you can easily adapt the code to other language pairs.
To perform machine translation with PyTorch, we need to follow these steps:
- Data preparation and preprocessing: We need to load the dataset, split it into train, validation, and test sets, tokenize the sentences, build the vocabularies, and convert the tokens to indices.
- Model implementation and training: We need to define the encoder, decoder, and attention classes, instantiate the model, define the loss function and the optimizer, and train the model on the train set.
- Evaluation and inference: We need to evaluate the model on the validation and test sets, calculate the metrics such as BLEU score, and generate translations for new sentences using the model.
In the next subsections, you will learn more about each step and see the code examples using PyTorch.
4.1. Data Preparation and Preprocessing
In this subsection, you will learn how to load the dataset, split it into train, validation, and test sets, tokenize the sentences, build the vocabularies, and convert the tokens to indices. These steps are essential for preparing and preprocessing the data for machine translation with PyTorch.
The dataset we will use is the Europarl dataset, which contains parallel sentences from the proceedings of the European Parliament in 21 languages. We will focus on the English-French pair, but you can easily adapt the code to other language pairs. The dataset is available in this link. You can download and extract the files using the following commands:
!wget https://www.statmt.org/europarl/v7/fr-en.tgz !tar -xvzf fr-en.tgz
After extracting the files, you will have two text files: europarl-v7.fr-en.en
and europarl-v7.fr-en.fr
, which contain the English and French sentences, respectively. Each line in the files corresponds to a sentence, and the sentences are aligned across the files. For example, the first line in the English file is:
"Resumption of the session"
And the first line in the French file is:
"Reprise de la session"
These are the translations of each other. The dataset contains about 2 million sentences, which is quite large for our tutorial. Therefore, we will use only a small fraction of the dataset, say 10,000 sentences, for faster training and evaluation. You can use the following commands to create smaller files with 10,000 sentences each:
!head -n 10000 europarl-v7.fr-en.en > small.en !head -n 10000 europarl-v7.fr-en.fr > small.fr
Now, we have two smaller files: small.en
and small.fr
, which contain the English and French sentences, respectively. We can use the torchtext library to load and process the data. Torchtext is a PyTorch library that provides tools and datasets for NLP. We will use the Field
class to define how to tokenize and numericalize the sentences. We will also use the TranslationDataset
class to load the parallel sentences and create the vocabularies. Finally, we will use the BucketIterator
class to create iterators that can batch and pad the sentences.
The code for data preparation and preprocessing is as follows:
import torch import torchtext from torchtext.data import Field, TranslationDataset, BucketIterator # define the source and target fields SRC = Field(tokenize = "spacy", # use spacy tokenizer tokenizer_language = "en_core_web_sm", # use English tokenizer init_token = "", # add start of sentence token eos_token = "", # add end of sentence token lower = True) # lowercase the sentences TRG = Field(tokenize = "spacy", # use spacy tokenizer tokenizer_language = "fr_core_news_sm", # use French tokenizer init_token = "", # add start of sentence token eos_token = "", # add end of sentence token lower = True) # lowercase the sentences # load the dataset dataset = TranslationDataset(path = ".", # the path of the files exts = (".en", ".fr"), # the extensions of the files fields = (SRC, TRG)) # the fields to use # split the dataset into train, validation, and test sets train_data, valid_data, test_data = dataset.split(split_ratio = [0.8, 0.1, 0.1]) # use 80% for train, 10% for validation, and 10% for test # build the vocabularies SRC.build_vocab(train_data, min_freq = 2) # use only the words that appear at least twice in the train set TRG.build_vocab(train_data, min_freq = 2) # use only the words that appear at least twice in the train set # create the iterators BATCH_SIZE = 64 # the batch size device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), # the datasets to use batch_size = BATCH_SIZE, # the batch size device = device, # the device to use sort_within_batch = True, # sort the sentences within each batch by length sort_key = lambda x: len(x.src)) # use the length of the source sentence as the sorting key
Now, we have the data ready for machine translation with PyTorch. In the next subsection, you will learn how to implement and train the model.
4.2. Model Implementation and Training
In this subsection, you will learn how to implement a sequence to sequence model with dot-product attention and multi-head attention using PyTorch. You will also learn how to train the model on a machine translation dataset and monitor the training progress.
The first step is to import the necessary modules and libraries. You will need PyTorch for building and training the model, torchtext for loading and processing the data, ignite for managing the training and validation loops, and matplotlib for plotting the results. You can use the following code to import them:
import torch import torch.nn as nn import torch.optim as optim import torchtext import ignite import matplotlib.pyplot as plt
The next step is to define the model architecture. You will use a standard encoder-decoder architecture with a bidirectional LSTM encoder and a unidirectional LSTM decoder. You will also use an embedding layer to convert the input and output tokens into dense vectors, and a linear layer to project the decoder outputs into logits. You can use the following code to define the model class:
class Seq2Seq(nn.Module): def __init__(self, input_dim, output_dim, emb_dim, hid_dim, n_layers, dropout, attention): super().__init__() # Define the encoder self.encoder = nn.LSTM(emb_dim, hid_dim, n_layers, bidirectional=True, dropout=dropout) # Define the decoder self.decoder = nn.LSTM(emb_dim, hid_dim * 2, n_layers, dropout=dropout) # Define the embedding layer for the input self.input_embedding = nn.Embedding(input_dim, emb_dim) # Define the embedding layer for the output self.output_embedding = nn.Embedding(output_dim, emb_dim) # Define the linear layer for the output logits self.output_linear = nn.Linear(hid_dim * 2, output_dim) # Define the attention mechanism self.attention = attention # Define the dropout layer self.dropout = nn.Dropout(dropout) def forward(self, input, output, teacher_forcing_ratio=0.5): # input: [input_len, batch_size] # output: [output_len, batch_size] # Get the input and output lengths input_len, batch_size = input.shape output_len, _ = output.shape # Get the output dimension output_dim = self.output_linear.out_features # Initialize the encoder hidden and cell states hidden, cell = self.init_hidden_cell(batch_size) # Embed the input tokens embedded_input = self.dropout(self.input_embedding(input)) # embedded_input: [input_len, batch_size, emb_dim] # Encode the input sequence encoder_outputs, (hidden, cell) = self.encoder(embedded_input, (hidden, cell)) # encoder_outputs: [input_len, batch_size, hid_dim * 2] # hidden: [n_layers * 2, batch_size, hid_dim] # cell: [n_layers * 2, batch_size, hid_dim] # Initialize the decoder output tensor decoder_outputs = torch.zeros(output_len, batch_size, output_dim).to(self.device) # Get the first decoder input (start-of-sequence token) decoder_input = output[0] # decoder_input: [batch_size] # Loop over the output sequence for t in range(1, output_len): # Embed the decoder input embedded_output = self.dropout(self.output_embedding(decoder_input.unsqueeze(0))) # embedded_output: [1, batch_size, emb_dim] # Decode the output token decoder_output, (hidden, cell) = self.decoder(embedded_output, (hidden, cell)) # decoder_output: [1, batch_size, hid_dim * 2] # hidden: [n_layers, batch_size, hid_dim * 2] # cell: [n_layers, batch_size, hid_dim * 2] # Apply the attention mechanism attention_output, attention_weights = self.attention(decoder_output, encoder_outputs) # attention_output: [1, batch_size, hid_dim * 2] # attention_weights: [batch_size, 1, input_len] # Project the decoder output into logits decoder_output = self.output_linear(attention_output) # decoder_output: [1, batch_size, output_dim] # Store the decoder output decoder_outputs[t] = decoder_output # Decide whether to use teacher forcing or not teacher_force = torch.rand(1) < teacher_forcing_ratio # Get the next decoder input decoder_input = output[t] if teacher_force else decoder_output.argmax(-1) # decoder_input: [batch_size] # Return the decoder outputs return decoder_outputs def init_hidden_cell(self, batch_size): # Initialize the hidden and cell states with zeros hidden = torch.zeros(self.encoder.num_layers * 2, batch_size, self.encoder.hidden_size).to(self.device) cell = torch.zeros(self.encoder.num_layers * 2, batch_size, self.encoder.hidden_size).to(self.device) return hidden, cell @property def device(self): # Return the device of the model return next(self.parameters()).device
The third step is to define the attention mechanism. You will use two types of attention: dot-product attention and multi-head attention. Dot-product attention computes the similarity between the decoder output and the encoder outputs using a dot product, and then applies a softmax to obtain the attention weights. Multi-head attention splits the decoder output and the encoder outputs into multiple heads, applies dot-product attention to each head, and then concatenates the results. You can use the following code to define the attention classes:
class DotProductAttention(nn.Module): def __init__(self): super().__init__() def forward(self, decoder_output, encoder_outputs): # decoder_output: [1, batch_size, hid_dim * 2] # encoder_outputs: [input_len, batch_size, hid_dim * 2] # Compute the dot product between the decoder output and the encoder outputs dot_product = torch.matmul(decoder_output, encoder_outputs.permute(1, 2, 0)) # dot_product: [batch_size, 1, input_len] # Apply a softmax to get the attention weights attention_weights = nn.functional.softmax(dot_product, dim=-1) # attention_weights: [batch_size, 1, input_len] # Multiply the attention weights with the encoder outputs to get the weighted sum weighted_sum = torch.matmul(attention_weights, encoder_outputs.permute(1, 0, 2)) # weighted_sum: [batch_size, 1, hid_dim * 2] # Permute the weighted sum to match the decoder output shape attention_output = weighted_sum.permute(1, 0, 2) # attention_output: [1, batch_size, hid_dim * 2] # Return the attention output and the attention weights return attention_output, attention_weights class MultiHeadAttention(nn.Module): def __init__(self, hid_dim, n_heads): super().__init__() # Check that the hidden dimension is divisible by the number of heads assert hid_dim % n_heads == 0 # Define the number of dimensions per head self.d_head = hid_dim // n_heads # Define the number of heads self.n_heads = n_heads # Define the linear layers for projecting the decoder output and the encoder outputs self.decoder_linear = nn.Linear(hid_dim, hid_dim) self.encoder_linear = nn.Linear(hid_dim, hid_dim) # Define the final linear layer self.final_linear = nn.Linear(hid_dim, hid_dim) def forward(self, decoder_output, encoder_outputs): # decoder_output: [1, batch_size, hid_dim * 2] # encoder_outputs: [input_len, batch_size, hid_dim * 2] # Project the decoder output and the encoder outputs decoder_output = self.decoder_linear(decoder_output) encoder_outputs = self.encoder_linear(encoder_outputs) # decoder_output: [1, batch_size, hid_dim * 2] # encoder_outputs: [input_len, batch_size, hid_dim * 2] # Split the decoder output and the encoder outputs into multiple heads decoder_output = self.split_heads(decoder_output) encoder_outputs = self.split_heads(encoder_outputs) # decoder_output: [n_heads, batch_size, 1, d_head * 2] # encoder_outputs: [n_heads, batch_size, input_len, d_head * 2] # Compute the dot product between the decoder output and the encoder outputs dot_product = torch.matmul(decoder_output, encoder_outputs.permute(0, 1, 3, 2)) # dot_product: [n_heads, batch_size, 1, input_len] # Apply a softmax to get the attention weights attention_weights = nn.functional.softmax(dot_product, dim=-1) # attention_weights:
4.3. Evaluation and Inference
In this subsection, you will learn how to evaluate and test the sequence to sequence model with dot-product attention and multi-head attention using PyTorch. You will also learn how to generate translations for new sentences using the model.
The first step is to define the evaluation metrics and the inference method. You will use two metrics to measure the quality of the translations: BLEU score and loss. BLEU score is a widely used metric that compares the generated translations with the reference translations based on the n-gram matches. Loss is the cross-entropy loss that measures how well the model predicts the correct tokens. You will use the torchtext.data.metrics.bleu_score function and the torch.nn.CrossEntropyLoss class to compute these metrics. You will also use a simple greedy search method to generate the translations. This method selects the token with the highest probability at each time step and feeds it to the next time step. You can use the following code to define the evaluation and inference functions:
def evaluate(model, data_iterator, criterion, device): # Set the model to evaluation mode model.eval() # Initialize the loss and the number of tokens loss = 0 n_tokens = 0 # Initialize the list of predictions and references predictions = [] references = [] # Loop over the validation data for batch in data_iterator: # Get the source and target sequences from the batch source = batch.src.to(device) target = batch.trg.to(device) # source: [source_len, batch_size] # target: [target_len, batch_size] # Get the output from the model output = model(source, target, teacher_forcing_ratio=0) # output: [target_len, batch_size, output_dim] # Reshape the output and the target output = output[1:].reshape(-1, output.shape[-1]) target = target[1:].reshape(-1) # output: [(target_len - 1) * batch_size, output_dim] # target: [(target_len - 1) * batch_size] # Calculate the loss batch_loss = criterion(output, target) # Update the loss and the number of tokens loss += batch_loss.item() n_tokens += target.shape[0] # Convert the output to tokens output_tokens = output.argmax(-1).reshape(target.shape[1], -1).cpu().numpy() # output_tokens: [batch_size, (target_len - 1)] # Convert the target to tokens target_tokens = target.reshape(target.shape[1], -1).cpu().numpy() # target_tokens: [batch_size, (target_len - 1)] # Extend the list of predictions and references predictions.extend(output_tokens) references.extend(target_tokens) # Calculate the average loss loss = loss / n_tokens # Calculate the BLEU score bleu = torchtext.data.metrics.bleu_score(predictions, references) # Return the loss and the BLEU score return loss, bleu def infer(model, source, source_field, target_field, device, max_len=50): # Set the model to evaluation mode model.eval() # Tokenize the source sequence tokens = source_field.tokenize(source) # Add the start-of-sequence and end-of-sequence tokens tokens = [source_field.init_token] + tokens + [source_field.eos_token] # Convert the tokens to indices indices = [source_field.vocab.stoi[token] for token in tokens] # Convert the indices to a tensor source = torch.LongTensor(indices).unsqueeze(1).to(device) # source: [source_len, 1] # Initialize the target sequence with the start-of-sequence token target = torch.LongTensor([target_field.vocab.stoi[target_field.init_token]]).unsqueeze(1).to(device) # target: [1, 1] # Initialize the list of output tokens output_tokens = [] # Loop until the end-of-sequence token or the maximum length is reached for _ in range(max_len): # Get the output from the model output = model(source, target, teacher_forcing_ratio=0) # output: [target_len, 1, output_dim] # Get the last output token output_token = output[-1].argmax(-1) # output_token: [1] # Add the output token to the list of output tokens output_tokens.append(output_token.item()) # Break if the end-of-sequence token is generated if output_token.item() == target_field.vocab.stoi[target_field.eos_token]: break # Update the target sequence target = torch.cat((target, output_token.unsqueeze(0)), dim=0) # target: [target_len, 1] # Convert the output tokens to strings output_strings = [target_field.vocab.itos[token] for token in output_tokens] # Join the output strings into a sentence output_sentence = ' '.join(output_strings) # Return the output sentence return output_sentence
The final step is to train and test the model. You will use the ignite library to manage the training and validation loops, and to log the metrics and the results. You will also use the ignite.contrib.handlers.TensorboardLogger class to visualize the metrics and the results in TensorBoard. You can use the following code to train and test the model:
# Define the hyperparameters input_dim = len(source_field.vocab) output_dim = len(target_field.vocab) emb_dim = 256 hid_dim = 512 n_layers = 2 dropout = 0.5 n_heads = 8 batch_size = 128 n_epochs = 10 learning_rate = 0.001 clip = 1 # Create the model model = Seq2Seq(input_dim, output_dim, emb_dim, hid_dim, n_layers, dropout, MultiHeadAttention(hid_dim, n_heads)).to(device) # Create the optimizer optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Create the loss function criterion = nn.CrossEntropyLoss(ignore_index=target_field.vocab.stoi[target_field.pad_token]) # Create the data iterators train_iterator, valid_iterator, test_iterator = torchtext.data.BucketIterator.splits((train_data, valid_data, test_data), batch_size=batch_size, device=device) # Create the trainer trainer = ignite.engine.create_supervised_trainer(model, optimizer, criterion, device=device) # Create the evaluator evaluator = ignite.engine.create_supervised_evaluator(model, metrics={"loss": ignite.metrics.Loss(criterion), "bleu": ignite.metrics.BLEUScore()}, device=device) # Create the TensorBoard logger tb_logger = ignite.contrib.handlers.TensorboardLogger(log_dir="logs") # Attach the logger to the trainer tb_logger.attach_output_handler(trainer, event_name=ignite.engine.Events.ITERATION_COMPLETED, output_transform=lambda x: {"batch_loss": x}) # Attach the logger to the evaluator tb_logger.attach_output_handler(evaluator, event_name=ignite.engine.Events.EPOCH_COMPLETED, output_transform=lambda x: x, tag="validation") # Define the training step def train_step(engine, batch): # Set the model to training mode model.train() # Get the source and target sequences from the batch source = batch.src.to(device) target = batch.trg.to(device) # source: [source_len, batch_size] # target: [target_len, batch_size] # Get the output from the model output = model(source, target) # output: [target_len, batch_size, output_dim] # Reshape the output and the target output = output[1:].reshape(-1, output.shape[-1]) target = target[1:].reshape(-1) # output: [(target_len - 1) * batch_size, output_dim] # target: [(target_len - 1) * batch_size] # Calculate the loss loss = criterion(output, target) # Clear the gradients optimizer.zero_grad() # Compute the gradients loss.backward() # Clip the gradients torch.nn.utils.clip_grad_norm_(model.parameters(), clip) # Update the parameters optimizer.step() # Return the loss return loss.item() # Attach the training step to the trainer trainer.add_event_handler(ignite.engine.Events.ITERATION_COMPLETED, train_step) # Define the validation step def validation_step(engine, batch): # Set the model to evaluation mode model.eval() # Get the source and target sequences from the batch source = batch.src.to(device) target = batch.trg.to(device) # source: [source_len, batch_size] # target: [target_len, batch_size] # Get the output from the model output
5. Conclusion
In this blog, you have learned how to use PyTorch to build a sequence to sequence model with attention mechanisms for machine translation tasks. You have covered the following topics:
- The basic concepts and components of sequence to sequence models and attention mechanisms.
- The encoder-decoder architecture and the different types of attention mechanisms, such as dot-product attention and multi-head attention.
- The data preparation and preprocessing steps, such as tokenization, vocabulary building, and batching.
- The model implementation and training steps, such as defining the model class, the optimizer, the loss function, and the training loop.
- The evaluation and inference steps, such as defining the metrics, the inference method, and the testing loop.
You have also seen how to use the ignite library to manage the training and validation loops, and to log the metrics and the results. You have also seen how to use the ignite.contrib.handlers.TensorboardLogger class to visualize the metrics and the results in TensorBoard.
By the end of this blog, you have built a sequence to sequence model with attention mechanisms that can translate sentences from English to French. You have also tested the model on some example sentences and seen the generated translations.
We hope you have enjoyed this blog and learned something new and useful. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!