PyTorch for NLP: Sequence to Sequence Models and Attention

This blog teaches you how to use PyTorch to build sequence to sequence models with attention mechanisms for machine translation tasks.

Table of Contents

1. Introduction

In this blog, you will learn how to use PyTorch to build sequence to sequence models with attention mechanisms for machine translation tasks.

Sequence to sequence models are a type of neural network models that can map a variable-length input sequence to a variable-length output sequence. They are widely used for natural language processing (NLP) tasks such as machine translation, text summarization, speech recognition, and chatbot generation.

Attention mechanisms are a technique that allows the model to focus on the most relevant parts of the input sequence when generating the output sequence. They can improve the performance and accuracy of sequence to sequence models, especially for long sequences.

PyTorch is an open-source framework that provides a flexible and easy-to-use platform for building and training deep learning models. It offers a rich set of tools and libraries for NLP, such as torchtext, transformers, and ignite.

By the end of this blog, you will be able to:

Understand the basic concepts and components of sequence to sequence models and attention mechanisms.
Implement a sequence to sequence model with dot-product attention and multi-head attention using PyTorch.
Train and evaluate the model on a machine translation dataset.
Generate translations for new sentences using the model.

Are you ready to dive into the world of sequence to sequence models and attention mechanisms with PyTorch? Let’s get started!

2. Sequence to Sequence Models

In this section, you will learn about the basic concepts and components of sequence to sequence models, which are the foundation of many NLP tasks such as machine translation.

A sequence to sequence model is a type of neural network model that can map a variable-length input sequence to a variable-length output sequence. For example, given a sentence in English as the input sequence, the model can generate a sentence in French as the output sequence.

A sequence to sequence model consists of two main parts: an encoder and a decoder. The encoder takes the input sequence and encodes it into a fixed-length vector, which is called the context vector. The context vector is supposed to capture the meaning and information of the input sequence. The decoder takes the context vector and generates the output sequence, one token at a time.

However, there are some challenges and limitations of using a simple sequence to sequence model. For instance, how can the model handle long sequences? How can the model deal with rare or unknown words? How can the model generate diverse and fluent outputs? To address these issues, researchers have proposed various techniques and extensions, such as teacher forcing, scheduled sampling, attention mechanisms, and multi-head attention.

In the next subsections, you will learn more about these techniques and how they can improve the performance and accuracy of sequence to sequence models.

2.1. Encoder-Decoder Architecture

In this subsection, you will learn more about the encoder-decoder architecture, which is the basic structure of sequence to sequence models. You will also see how to implement it using PyTorch.

The encoder-decoder architecture consists of two main components: an encoder and a decoder. The encoder is a neural network that takes the input sequence as a series of tokens and encodes it into a fixed-length vector, which is called the context vector. The context vector is supposed to capture the meaning and information of the input sequence. The decoder is another neural network that takes the context vector as input and generates the output sequence as a series of tokens, one at a time.

There are different types of neural networks that can be used as encoders and decoders, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers. In this tutorial, we will use RNNs as our encoder and decoder, since they are well-suited for sequential data. RNNs are neural networks that have a hidden state that can store information from previous inputs. They can process variable-length sequences and learn long-term dependencies.

To implement the encoder-decoder architecture using PyTorch, we need to define two classes: Encoder and Decoder. The Encoder class will inherit from torch.nn.Module, which is the base class for all neural network modules in PyTorch. The Encoder class will have the following attributes and methods:

input_dim: the dimension of the input vocabulary, which is the number of unique tokens in the input language.
hidden_dim: the dimension of the hidden state of the RNN, which is also the dimension of the context vector.
embedding: an embedding layer that maps the input tokens to dense vectors of size hidden_dim.
rnn: an RNN layer that takes the embedded input sequence and outputs the hidden state and the context vector.
forward: a method that takes the input sequence and returns the context vector.

The code for the Encoder class is as follows:

import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Encoder, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(input_dim, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, hidden_dim)

    def forward(self, input_seq):
        # input_seq: [seq_len, batch_size]
        embedded = self.embedding(input_seq)
        # embedded: [seq_len, batch_size, hidden_dim]
        output, hidden = self.rnn(embedded)
        # output: [seq_len, batch_size, hidden_dim]
        # hidden: [1, batch_size, hidden_dim]
        return hidden # context vector

2.2. Teacher Forcing and Scheduled Sampling

In this subsection, you will learn about two techniques that can improve the training and generation of sequence to sequence models: teacher forcing and scheduled sampling. You will also see how to implement them using PyTorch.

Teacher forcing is a technique that feeds the ground truth output tokens to the decoder during training, instead of the tokens generated by the decoder in the previous time step. This can speed up the convergence and reduce the exposure bias of the model, which is the discrepancy between the training and inference scenarios. However, teacher forcing can also lead to overfitting and instability, especially when the input and output sequences have different lengths or structures.

Scheduled sampling is a technique that gradually reduces the use of teacher forcing and increases the use of the decoder’s own predictions during training. This can help the model to learn from its own mistakes and generate more diverse and robust outputs. However, scheduled sampling can also introduce inconsistency and noise, especially when the model is not confident or accurate enough.

To implement teacher forcing and scheduled sampling using PyTorch, we need to modify the Decoder class that we defined in the previous subsection. The Decoder class will inherit from torch.nn.Module, which is the base class for all neural network modules in PyTorch. The Decoder class will have the following attributes and methods:

output_dim: the dimension of the output vocabulary, which is the number of unique tokens in the output language.
hidden_dim: the dimension of the hidden state of the RNN, which is the same as the dimension of the context vector.
embedding: an embedding layer that maps the output tokens to dense vectors of size hidden_dim.
rnn: an RNN layer that takes the embedded output token and the context vector as inputs and outputs the hidden state and the output vector.
fc: a linear layer that takes the output vector and maps it to the output vocabulary.
forward: a method that takes the output token, the context vector, and a teacher forcing ratio as inputs and returns the output token and the hidden state.

The code for the Decoder class is as follows:

import torch
import torch.nn as nn
import random

class Decoder(nn.Module):
    def __init__(self, output_dim, hidden_dim):
        super(Decoder, self).__init__()
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(output_dim, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, output_token, context_vector, teacher_forcing_ratio):
        # output_token: [batch_size]
        # context_vector: [1, batch_size, hidden_dim]
        # teacher_forcing_ratio: a float between 0 and 1
        output_token = output_token.unsqueeze(0)
        # output_token: [1, batch_size]
        embedded = self.embedding(output_token)
        # embedded: [1, batch_size, hidden_dim]
        output, hidden = self.rnn(embedded, context_vector)
        # output: [1, batch_size, hidden_dim]
        # hidden: [1, batch_size, hidden_dim]
        output = self.fc(output.squeeze(0))
        # output: [batch_size, output_dim]
        if random.random() < teacher_forcing_ratio:
            # use the ground truth output token as the next input token
            output_token = output_token.squeeze(0)
        else:
            # use the model's prediction as the next input token
            output_token = output.argmax(1)
        return output, hidden

3. Attention Mechanisms

In this section, you will learn about the attention mechanisms, which are a technique that can enhance the performance and accuracy of sequence to sequence models for machine translation and other NLP tasks. You will also see how to implement two types of attention mechanisms using PyTorch: dot-product attention and multi-head attention.

Attention mechanisms are a technique that allows the decoder to focus on the most relevant parts of the input sequence when generating the output sequence. They can overcome the limitations of using a fixed-length context vector to encode the entire input sequence, which can cause information loss and degradation for long sequences. They can also improve the alignment and fluency of the output sequence, especially for languages that have different word orders or structures.

The basic idea of attention mechanisms is to compute a score or a weight for each input token, based on its similarity or relevance to the output token. The score or weight is then used to calculate a weighted sum of the input tokens, which is called the attention vector. The attention vector is then concatenated with the output token and fed to the decoder to generate the next output token.

There are different ways to compute the score or weight for each input token, such as additive attention, multiplicative attention, and scaled dot-product attention. In this tutorial, we will use dot-product attention, which is a simple and efficient way to calculate the score by taking the dot product of the input token and the output token. We will also use multi-head attention, which is a technique that allows the model to attend to different aspects or features of the input sequence by using multiple attention heads.

In the next subsections, you will learn more about these two types of attention mechanisms and how to implement them using PyTorch.

3.1. Dot-Product Attention

In this subsection, you will learn more about the dot-product attention, which is a simple and efficient way to compute the score or weight for each input token based on its similarity or relevance to the output token. You will also see how to implement it using PyTorch.

Dot-product attention is a type of attention mechanism that calculates the score by taking the dot product of the input token and the output token. The dot product is a mathematical operation that measures the angle between two vectors. The higher the dot product, the smaller the angle, and the more similar the vectors are. Therefore, the dot product can be used as a measure of similarity or relevance between the input token and the output token.

To implement dot-product attention using PyTorch, we need to define a class called DotProductAttention. The DotProductAttention class will inherit from torch.nn.Module, which is the base class for all neural network modules in PyTorch. The DotProductAttention class will have the following attributes and methods:

hidden_dim: the dimension of the hidden state of the RNN, which is also the dimension of the input token and the output token.
scale: a scaling factor that divides the dot product by the square root of the hidden dimension. This is to prevent the dot product from becoming too large or too small, which can cause numerical instability or gradient vanishing.
forward: a method that takes the input sequence, the output token, and the context vector as inputs and returns the attention vector and the attention weights.

The code for the DotProductAttention class is as follows:

import torch
import torch.nn as nn
import math

class DotProductAttention(nn.Module):
    def __init__(self, hidden_dim):
        super(DotProductAttention, self).__init__()
        self.hidden_dim = hidden_dim
        self.scale = math.sqrt(hidden_dim)

    def forward(self, input_seq, output_token, context_vector):
        # input_seq: [seq_len, batch_size, hidden_dim]
        # output_token: [1, batch_size, hidden_dim]
        # context_vector: [1, batch_size, hidden_dim]
        score = torch.bmm(output_token, input_seq.permute(1, 2, 0)) / self.scale
        # score: [batch_size, 1, seq_len]
        weight = torch.softmax(score, dim=-1)
        # weight: [batch_size, 1, seq_len]
        attention = torch.bmm(weight, input_seq.permute(1, 0, 2))
        # attention: [batch_size, 1, hidden_dim]
        attention = attention.permute(1, 0, 2)
        # attention: [1, batch_size, hidden_dim]
        attention = torch.cat((attention, output_token), dim=2)
        # attention: [1, batch_size, 2 * hidden_dim]
        return attention, weight

3.2. Multi-Head Attention

In this subsection, you will learn more about the multi-head attention, which is a technique that allows the model to attend to different aspects or features of the input sequence by using multiple attention heads. You will also see how to implement it using PyTorch.

Multi-head attention is a type of attention mechanism that splits the input token and the output token into multiple sub-vectors, each of which is processed by a separate attention head. The attention heads can learn different patterns or relationships between the input and output tokens, such as syntactic, semantic, or positional information. The outputs of the attention heads are then concatenated and projected to form the final attention vector.

Multi-head attention can improve the performance and accuracy of sequence to sequence models, especially for complex and diverse tasks such as machine translation. It can also increase the parallelizability and efficiency of the model, since the attention heads can be computed in parallel.

To implement multi-head attention using PyTorch, we need to define a class called MultiHeadAttention. The MultiHeadAttention class will inherit from torch.nn.Module, which is the base class for all neural network modules in PyTorch. The MultiHeadAttention class will have the following attributes and methods:

hidden_dim: the dimension of the hidden state of the RNN, which is also the dimension of the input token and the output token.
num_heads: the number of attention heads to use.
head_dim: the dimension of each sub-vector, which is equal to hidden_dim / num_heads.
dot_product_attention: an instance of the DotProductAttention class that we defined in the previous subsection.
fc: a linear layer that projects the concatenated output of the attention heads to the hidden dimension.
forward: a method that takes the input sequence, the output token, and the context vector as inputs and returns the attention vector and the attention weights.

The code for the MultiHeadAttention class is as follows:

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = hidden_dim // num_heads
        self.dot_product_attention = DotProductAttention(self.head_dim)
        self.fc = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, input_seq, output_token, context_vector):
        # input_seq: [seq_len, batch_size, hidden_dim]
        # output_token: [1, batch_size, hidden_dim]
        # context_vector: [1, batch_size, hidden_dim]
        batch_size = input_seq.shape[1]
        input_seq = input_seq.view(-1, batch_size, self.num_heads, self.head_dim)
        # input_seq: [seq_len, batch_size, num_heads, head_dim]
        input_seq = input_seq.permute(1, 2, 0, 3)
        # input_seq: [batch_size, num_heads, seq_len, head_dim]
        output_token = output_token.view(-1, batch_size, self.num_heads, self.head_dim)
        # output_token: [1, batch_size, num_heads, head_dim]
        output_token = output_token.permute(1, 2, 0, 3)
        # output_token: [batch_size, num_heads, 1, head_dim]
        context_vector = context_vector.view(-1, batch_size, self.num_heads, self.head_dim)
        # context_vector: [1, batch_size, num_heads, head_dim]
        context_vector = context_vector.permute(1, 2, 0, 3)
        # context_vector: [batch_size, num_heads, 1, head_dim]
        attention = []
        weights = []
        for i in range(self.num_heads):
            # process each attention head separately
            input_seq_i = input_seq[:, i, :, :]
            # input_seq_i: [batch_size, seq_len, head_dim]
            output_token_i = output_token[:, i, :, :]
            # output_token_i: [batch_size, 1, head_dim]
            context_vector_i = context_vector[:, i, :, :]
            # context_vector_i: [batch_size, 1, head_dim]
            attention_i, weight_i = self.dot_product_attention(input_seq_i, output_token_i, context_vector_i)
            # attention_i: [batch_size, 1, 2 * head_dim]
            # weight_i: [batch_size, 1, seq_len]
            attention.append(attention_i)
            weights.append(weight_i)
        attention = torch.cat(attention, dim=1)
        # attention: [batch_size, num_heads, 2 * head_dim]
        attention = attention.permute(1, 0, 2)
        # attention: [num_heads, batch_size, 2 * head_dim]
        attention = attention.reshape(1, batch_size, -1)
        # attention: [1, batch_size, 2 * hidden_dim]
        attention = self.fc(attention)
        # attention: [1, batch_size, hidden_dim]
        weights = torch.cat(weights, dim=1)
        # weights: [batch_size, num_heads, seq_len]
        return attention, weights

4. Machine Translation with PyTorch

In this section, you will learn how to use PyTorch to build a sequence to sequence model with dot-product attention and multi-head attention for machine translation tasks. You will also see how to train and evaluate the model on a machine translation dataset and generate translations for new sentences using the model.

Machine translation is the task of automatically translating a text from one language to another. It is one of the most challenging and popular applications of sequence to sequence models and attention mechanisms. In this tutorial, we will use the Europarl dataset, which contains parallel sentences from the proceedings of the European Parliament in 21 languages. We will focus on the English-French pair, but you can easily adapt the code to other language pairs.

To perform machine translation with PyTorch, we need to follow these steps:

Data preparation and preprocessing: We need to load the dataset, split it into train, validation, and test sets, tokenize the sentences, build the vocabularies, and convert the tokens to indices.
Model implementation and training: We need to define the encoder, decoder, and attention classes, instantiate the model, define the loss function and the optimizer, and train the model on the train set.
Evaluation and inference: We need to evaluate the model on the validation and test sets, calculate the metrics such as BLEU score, and generate translations for new sentences using the model.

In the next subsections, you will learn more about each step and see the code examples using PyTorch.

4.1. Data Preparation and Preprocessing

In this subsection, you will learn how to load the dataset, split it into train, validation, and test sets, tokenize the sentences, build the vocabularies, and convert the tokens to indices. These steps are essential for preparing and preprocessing the data for machine translation with PyTorch.

The dataset we will use is the Europarl dataset, which contains parallel sentences from the proceedings of the European Parliament in 21 languages. We will focus on the English-French pair, but you can easily adapt the code to other language pairs. The dataset is available in this link. You can download and extract the files using the following commands:

!wget https://www.statmt.org/europarl/v7/fr-en.tgz
!tar -xvzf fr-en.tgz

After extracting the files, you will have two text files: europarl-v7.fr-en.en and europarl-v7.fr-en.fr, which contain the English and French sentences, respectively. Each line in the files corresponds to a sentence, and the sentences are aligned across the files. For example, the first line in the English file is:

"Resumption of the session"

And the first line in the French file is:

"Reprise de la session"

These are the translations of each other. The dataset contains about 2 million sentences, which is quite large for our tutorial. Therefore, we will use only a small fraction of the dataset, say 10,000 sentences, for faster training and evaluation. You can use the following commands to create smaller files with 10,000 sentences each:

!head -n 10000 europarl-v7.fr-en.en > small.en
!head -n 10000 europarl-v7.fr-en.fr > small.fr

Now, we have two smaller files: small.en and small.fr, which contain the English and French sentences, respectively. We can use the torchtext library to load and process the data. Torchtext is a PyTorch library that provides tools and datasets for NLP. We will use the Field class to define how to tokenize and numericalize the sentences. We will also use the TranslationDataset class to load the parallel sentences and create the vocabularies. Finally, we will use the BucketIterator class to create iterators that can batch and pad the sentences.

The code for data preparation and preprocessing is as follows:

import torch
import torchtext
from torchtext.data import Field, TranslationDataset, BucketIterator

# define the source and target fields
SRC = Field(tokenize = "spacy", # use spacy tokenizer
            tokenizer_language = "en_core_web_sm", # use English tokenizer
            init_token = "", # add start of sentence token
            eos_token = "", # add end of sentence token
            lower = True) # lowercase the sentences

TRG = Field(tokenize = "spacy", # use spacy tokenizer
            tokenizer_language = "fr_core_news_sm", # use French tokenizer
            init_token = "", # add start of sentence token
            eos_token = "", # add end of sentence token
            lower = True) # lowercase the sentences

# load the dataset
dataset = TranslationDataset(path = ".", # the path of the files
                             exts = (".en", ".fr"), # the extensions of the files
                             fields = (SRC, TRG)) # the fields to use

# split the dataset into train, validation, and test sets
train_data, valid_data, test_data = dataset.split(split_ratio = [0.8, 0.1, 0.1]) # use 80% for train, 10% for validation, and 10% for test

# build the vocabularies
SRC.build_vocab(train_data, min_freq = 2) # use only the words that appear at least twice in the train set
TRG.build_vocab(train_data, min_freq = 2) # use only the words that appear at least twice in the train set

# create the iterators
BATCH_SIZE = 64 # the batch size
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), # the datasets to use
                                                                     batch_size = BATCH_SIZE, # the batch size
                                                                     device = device, # the device to use
                                                                     sort_within_batch = True, # sort the sentences within each batch by length
                                                                     sort_key = lambda x: len(x.src)) # use the length of the source sentence as the sorting key

Now, we have the data ready for machine translation with PyTorch. In the next subsection, you will learn how to implement and train the model.

4.2. Model Implementation and Training

In this subsection, you will learn how to implement a sequence to sequence model with dot-product attention and multi-head attention using PyTorch. You will also learn how to train the model on a machine translation dataset and monitor the training progress.

The first step is to import the necessary modules and libraries. You will need PyTorch for building and training the model, torchtext for loading and processing the data, ignite for managing the training and validation loops, and matplotlib for plotting the results. You can use the following code to import them:

import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import ignite
import matplotlib.pyplot as plt

The next step is to define the model architecture. You will use a standard encoder-decoder architecture with a bidirectional LSTM encoder and a unidirectional LSTM decoder. You will also use an embedding layer to convert the input and output tokens into dense vectors, and a linear layer to project the decoder outputs into logits. You can use the following code to define the model class:

class Seq2Seq(nn.Module):
    def __init__(self, input_dim, output_dim, emb_dim, hid_dim, n_layers, dropout, attention):
        super().__init__()
        # Define the encoder
        self.encoder = nn.LSTM(emb_dim, hid_dim, n_layers, bidirectional=True, dropout=dropout)
        # Define the decoder
        self.decoder = nn.LSTM(emb_dim, hid_dim * 2, n_layers, dropout=dropout)
        # Define the embedding layer for the input
        self.input_embedding = nn.Embedding(input_dim, emb_dim)
        # Define the embedding layer for the output
        self.output_embedding = nn.Embedding(output_dim, emb_dim)
        # Define the linear layer for the output logits
        self.output_linear = nn.Linear(hid_dim * 2, output_dim)
        # Define the attention mechanism
        self.attention = attention
        # Define the dropout layer
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input, output, teacher_forcing_ratio=0.5):
        # input: [input_len, batch_size]
        # output: [output_len, batch_size]
        # Get the input and output lengths
        input_len, batch_size = input.shape
        output_len, _ = output.shape
        # Get the output dimension
        output_dim = self.output_linear.out_features
        # Initialize the encoder hidden and cell states
        hidden, cell = self.init_hidden_cell(batch_size)
        # Embed the input tokens
        embedded_input = self.dropout(self.input_embedding(input))
        # embedded_input: [input_len, batch_size, emb_dim]
        # Encode the input sequence
        encoder_outputs, (hidden, cell) = self.encoder(embedded_input, (hidden, cell))
        # encoder_outputs: [input_len, batch_size, hid_dim * 2]
        # hidden: [n_layers * 2, batch_size, hid_dim]
        # cell: [n_layers * 2, batch_size, hid_dim]
        # Initialize the decoder output tensor
        decoder_outputs = torch.zeros(output_len, batch_size, output_dim).to(self.device)
        # Get the first decoder input (start-of-sequence token)
        decoder_input = output[0]
        # decoder_input: [batch_size]
        # Loop over the output sequence
        for t in range(1, output_len):
            # Embed the decoder input
            embedded_output = self.dropout(self.output_embedding(decoder_input.unsqueeze(0)))
            # embedded_output: [1, batch_size, emb_dim]
            # Decode the output token
            decoder_output, (hidden, cell) = self.decoder(embedded_output, (hidden, cell))
            # decoder_output: [1, batch_size, hid_dim * 2]
            # hidden: [n_layers, batch_size, hid_dim * 2]
            # cell: [n_layers, batch_size, hid_dim * 2]
            # Apply the attention mechanism
            attention_output, attention_weights = self.attention(decoder_output, encoder_outputs)
            # attention_output: [1, batch_size, hid_dim * 2]
            # attention_weights: [batch_size, 1, input_len]
            # Project the decoder output into logits
            decoder_output = self.output_linear(attention_output)
            # decoder_output: [1, batch_size, output_dim]
            # Store the decoder output
            decoder_outputs[t] = decoder_output
            # Decide whether to use teacher forcing or not
            teacher_force = torch.rand(1) < teacher_forcing_ratio
            # Get the next decoder input
            decoder_input = output[t] if teacher_force else decoder_output.argmax(-1)
            # decoder_input: [batch_size]
        # Return the decoder outputs
        return decoder_outputs
    
    def init_hidden_cell(self, batch_size):
        # Initialize the hidden and cell states with zeros
        hidden = torch.zeros(self.encoder.num_layers * 2, batch_size, self.encoder.hidden_size).to(self.device)
        cell = torch.zeros(self.encoder.num_layers * 2, batch_size, self.encoder.hidden_size).to(self.device)
        return hidden, cell
    
    @property
    def device(self):
        # Return the device of the model
        return next(self.parameters()).device

The third step is to define the attention mechanism. You will use two types of attention: dot-product attention and multi-head attention. Dot-product attention computes the similarity between the decoder output and the encoder outputs using a dot product, and then applies a softmax to obtain the attention weights. Multi-head attention splits the decoder output and the encoder outputs into multiple heads, applies dot-product attention to each head, and then concatenates the results. You can use the following code to define the attention classes:

class DotProductAttention(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, decoder_output, encoder_outputs):
        # decoder_output: [1, batch_size, hid_dim * 2]
        # encoder_outputs: [input_len, batch_size, hid_dim * 2]
        # Compute the dot product between the decoder output and the encoder outputs
        dot_product = torch.matmul(decoder_output, encoder_outputs.permute(1, 2, 0))
        # dot_product: [batch_size, 1, input_len]
        # Apply a softmax to get the attention weights
        attention_weights = nn.functional.softmax(dot_product, dim=-1)
        # attention_weights: [batch_size, 1, input_len]
        # Multiply the attention weights with the encoder outputs to get the weighted sum
        weighted_sum = torch.matmul(attention_weights, encoder_outputs.permute(1, 0, 2))
        # weighted_sum: [batch_size, 1, hid_dim * 2]
        # Permute the weighted sum to match the decoder output shape
        attention_output = weighted_sum.permute(1, 0, 2)
        # attention_output: [1, batch_size, hid_dim * 2]
        # Return the attention output and the attention weights
        return attention_output, attention_weights

class MultiHeadAttention(nn.Module):
    def __init__(self, hid_dim, n_heads):
        super().__init__()
        # Check that the hidden dimension is divisible by the number of heads
        assert hid_dim % n_heads == 0
        # Define the number of dimensions per head
        self.d_head = hid_dim // n_heads
        # Define the number of heads
        self.n_heads = n_heads
        # Define the linear layers for projecting the decoder output and the encoder outputs
        self.decoder_linear = nn.Linear(hid_dim, hid_dim)
        self.encoder_linear = nn.Linear(hid_dim, hid_dim)
        # Define the final linear layer
        self.final_linear = nn.Linear(hid_dim, hid_dim)
    
    def forward(self, decoder_output, encoder_outputs):
        # decoder_output: [1, batch_size, hid_dim * 2]
        # encoder_outputs: [input_len, batch_size, hid_dim * 2]
        # Project the decoder output and the encoder outputs
        decoder_output = self.decoder_linear(decoder_output)
        encoder_outputs = self.encoder_linear(encoder_outputs)
        # decoder_output: [1, batch_size, hid_dim * 2]
        # encoder_outputs: [input_len, batch_size, hid_dim * 2]
        # Split the decoder output and the encoder outputs into multiple heads
        decoder_output = self.split_heads(decoder_output)
        encoder_outputs = self.split_heads(encoder_outputs)
        # decoder_output: [n_heads, batch_size, 1, d_head * 2]
        # encoder_outputs: [n_heads, batch_size, input_len, d_head * 2]
        # Compute the dot product between the decoder output and the encoder outputs
        dot_product = torch.matmul(decoder_output, encoder_outputs.permute(0, 1, 3, 2))
        # dot_product: [n_heads, batch_size, 1, input_len]
        # Apply a softmax to get the attention weights
        attention_weights = nn.functional.softmax(dot_product, dim=-1)
        # attention_weights:

4.3. Evaluation and Inference

In this subsection, you will learn how to evaluate and test the sequence to sequence model with dot-product attention and multi-head attention using PyTorch. You will also learn how to generate translations for new sentences using the model.

The first step is to define the evaluation metrics and the inference method. You will use two metrics to measure the quality of the translations: BLEU score and loss. BLEU score is a widely used metric that compares the generated translations with the reference translations based on the n-gram matches. Loss is the cross-entropy loss that measures how well the model predicts the correct tokens. You will use the torchtext.data.metrics.bleu_score function and the torch.nn.CrossEntropyLoss class to compute these metrics. You will also use a simple greedy search method to generate the translations. This method selects the token with the highest probability at each time step and feeds it to the next time step. You can use the following code to define the evaluation and inference functions:

def evaluate(model, data_iterator, criterion, device):
    # Set the model to evaluation mode
    model.eval()
    # Initialize the loss and the number of tokens
    loss = 0
    n_tokens = 0
    # Initialize the list of predictions and references
    predictions = []
    references = []
    # Loop over the validation data
    for batch in data_iterator:
        # Get the source and target sequences from the batch
        source = batch.src.to(device)
        target = batch.trg.to(device)
        # source: [source_len, batch_size]
        # target: [target_len, batch_size]
        # Get the output from the model
        output = model(source, target, teacher_forcing_ratio=0)
        # output: [target_len, batch_size, output_dim]
        # Reshape the output and the target
        output = output[1:].reshape(-1, output.shape[-1])
        target = target[1:].reshape(-1)
        # output: [(target_len - 1) * batch_size, output_dim]
        # target: [(target_len - 1) * batch_size]
        # Calculate the loss
        batch_loss = criterion(output, target)
        # Update the loss and the number of tokens
        loss += batch_loss.item()
        n_tokens += target.shape[0]
        # Convert the output to tokens
        output_tokens = output.argmax(-1).reshape(target.shape[1], -1).cpu().numpy()
        # output_tokens: [batch_size, (target_len - 1)]
        # Convert the target to tokens
        target_tokens = target.reshape(target.shape[1], -1).cpu().numpy()
        # target_tokens: [batch_size, (target_len - 1)]
        # Extend the list of predictions and references
        predictions.extend(output_tokens)
        references.extend(target_tokens)
    # Calculate the average loss
    loss = loss / n_tokens
    # Calculate the BLEU score
    bleu = torchtext.data.metrics.bleu_score(predictions, references)
    # Return the loss and the BLEU score
    return loss, bleu

def infer(model, source, source_field, target_field, device, max_len=50):
    # Set the model to evaluation mode
    model.eval()
    # Tokenize the source sequence
    tokens = source_field.tokenize(source)
    # Add the start-of-sequence and end-of-sequence tokens
    tokens = [source_field.init_token] + tokens + [source_field.eos_token]
    # Convert the tokens to indices
    indices = [source_field.vocab.stoi[token] for token in tokens]
    # Convert the indices to a tensor
    source = torch.LongTensor(indices).unsqueeze(1).to(device)
    # source: [source_len, 1]
    # Initialize the target sequence with the start-of-sequence token
    target = torch.LongTensor([target_field.vocab.stoi[target_field.init_token]]).unsqueeze(1).to(device)
    # target: [1, 1]
    # Initialize the list of output tokens
    output_tokens = []
    # Loop until the end-of-sequence token or the maximum length is reached
    for _ in range(max_len):
        # Get the output from the model
        output = model(source, target, teacher_forcing_ratio=0)
        # output: [target_len, 1, output_dim]
        # Get the last output token
        output_token = output[-1].argmax(-1)
        # output_token: [1]
        # Add the output token to the list of output tokens
        output_tokens.append(output_token.item())
        # Break if the end-of-sequence token is generated
        if output_token.item() == target_field.vocab.stoi[target_field.eos_token]:
            break
        # Update the target sequence
        target = torch.cat((target, output_token.unsqueeze(0)), dim=0)
        # target: [target_len, 1]
    # Convert the output tokens to strings
    output_strings = [target_field.vocab.itos[token] for token in output_tokens]
    # Join the output strings into a sentence
    output_sentence = ' '.join(output_strings)
    # Return the output sentence
    return output_sentence

The final step is to train and test the model. You will use the ignite library to manage the training and validation loops, and to log the metrics and the results. You will also use the ignite.contrib.handlers.TensorboardLogger class to visualize the metrics and the results in TensorBoard. You can use the following code to train and test the model:

# Define the hyperparameters
input_dim = len(source_field.vocab)
output_dim = len(target_field.vocab)
emb_dim = 256
hid_dim = 512
n_layers = 2
dropout = 0.5
n_heads = 8
batch_size = 128
n_epochs = 10
learning_rate = 0.001
clip = 1
# Create the model
model = Seq2Seq(input_dim, output_dim, emb_dim, hid_dim, n_layers, dropout, MultiHeadAttention(hid_dim, n_heads)).to(device)
# Create the optimizer
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# Create the loss function
criterion = nn.CrossEntropyLoss(ignore_index=target_field.vocab.stoi[target_field.pad_token])
# Create the data iterators
train_iterator, valid_iterator, test_iterator = torchtext.data.BucketIterator.splits((train_data, valid_data, test_data), batch_size=batch_size, device=device)
# Create the trainer
trainer = ignite.engine.create_supervised_trainer(model, optimizer, criterion, device=device)
# Create the evaluator
evaluator = ignite.engine.create_supervised_evaluator(model, metrics={"loss": ignite.metrics.Loss(criterion), "bleu": ignite.metrics.BLEUScore()}, device=device)
# Create the TensorBoard logger
tb_logger = ignite.contrib.handlers.TensorboardLogger(log_dir="logs")
# Attach the logger to the trainer
tb_logger.attach_output_handler(trainer, event_name=ignite.engine.Events.ITERATION_COMPLETED, output_transform=lambda x: {"batch_loss": x})
# Attach the logger to the evaluator
tb_logger.attach_output_handler(evaluator, event_name=ignite.engine.Events.EPOCH_COMPLETED, output_transform=lambda x: x, tag="validation")
# Define the training step
def train_step(engine, batch):
    # Set the model to training mode
    model.train()
    # Get the source and target sequences from the batch
    source = batch.src.to(device)
    target = batch.trg.to(device)
    # source: [source_len, batch_size]
    # target: [target_len, batch_size]
    # Get the output from the model
    output = model(source, target)
    # output: [target_len, batch_size, output_dim]
    # Reshape the output and the target
    output = output[1:].reshape(-1, output.shape[-1])
    target = target[1:].reshape(-1)
    # output: [(target_len - 1) * batch_size, output_dim]
    # target: [(target_len - 1) * batch_size]
    # Calculate the loss
    loss = criterion(output, target)
    # Clear the gradients
    optimizer.zero_grad()
    # Compute the gradients
    loss.backward()
    # Clip the gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
    # Update the parameters
    optimizer.step()
    # Return the loss
    return loss.item()
# Attach the training step to the trainer
trainer.add_event_handler(ignite.engine.Events.ITERATION_COMPLETED, train_step)
# Define the validation step
def validation_step(engine, batch):
    # Set the model to evaluation mode
    model.eval()
    # Get the source and target sequences from the batch
    source = batch.src.to(device)
    target = batch.trg.to(device)
    # source: [source_len, batch_size]
    # target: [target_len, batch_size]
    # Get the output from the model
    output

5. Conclusion

In this blog, you have learned how to use PyTorch to build a sequence to sequence model with attention mechanisms for machine translation tasks. You have covered the following topics:

The basic concepts and components of sequence to sequence models and attention mechanisms.
The encoder-decoder architecture and the different types of attention mechanisms, such as dot-product attention and multi-head attention.
The data preparation and preprocessing steps, such as tokenization, vocabulary building, and batching.
The model implementation and training steps, such as defining the model class, the optimizer, the loss function, and the training loop.
The evaluation and inference steps, such as defining the metrics, the inference method, and the testing loop.

You have also seen how to use the ignite library to manage the training and validation loops, and to log the metrics and the results. You have also seen how to use the ignite.contrib.handlers.TensorboardLogger class to visualize the metrics and the results in TensorBoard.

By the end of this blog, you have built a sequence to sequence model with attention mechanisms that can translate sentences from English to French. You have also tested the model on some example sentences and seen the generated translations.

We hope you have enjoyed this blog and learned something new and useful. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!