PyTorch for NLP: Building a Text Classifier

This blog teaches you how to build, train, and evaluate a text classifier using PyTorch and a dataset of movie reviews. You will learn how to use PyTorch’s modules and functions to create a neural network that can classify movie reviews as positive or negative.

1. Introduction

In this blog, you will learn how to build, train, and evaluate a text classifier using PyTorch and a dataset of movie reviews. A text classifier is a natural language processing (NLP) application that can automatically assign a label to a given text based on its content. For example, a text classifier can determine whether a movie review is positive or negative, or whether an email is spam or not.

Text classification is a common and useful task in NLP, and PyTorch is a popular and powerful deep learning framework that can help you create and train your own text classifier. PyTorch provides various modules and functions that can simplify the process of building and training a neural network, as well as enable you to customize your model according to your needs and preferences.

In this tutorial, you will use PyTorch to create a text classifier that can classify movie reviews as positive or negative. You will use a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb), which is a widely used benchmark for sentiment analysis. You will learn how to:

Prepare the dataset for text classification
Define the model architecture using PyTorch’s modules and functions
Train the model using PyTorch’s loss function and optimizer
Evaluate the model using various metrics and visualizations
Test the model on new and unseen movie reviews

By the end of this tutorial, you will have a working text classifier that can accurately predict the sentiment of movie reviews, as well as a solid understanding of how to use PyTorch for NLP tasks.

Are you ready to get started? Let’s dive in!

2. PyTorch Basics

Before you start building your text classifier, you need to have some basic knowledge of PyTorch and how it works. PyTorch is a Python-based library that provides a flexible and expressive way of creating and training deep neural networks. PyTorch has three main components that you will use in this tutorial: tensors, autograd, and modules and functions.

Tensors are the fundamental data structures of PyTorch. They are similar to NumPy arrays, but they can also be used on GPUs and other devices for faster computation. Tensors can have different shapes and dimensions, and they can store various types of data, such as scalars, vectors, matrices, and images. You will use tensors to store and manipulate your input and output data, as well as the parameters of your model.

Autograd is the automatic differentiation engine of PyTorch. It allows you to compute the gradients of your tensors with respect to any other tensors in your computation graph. This is very useful for optimizing your model, as you can use the gradients to update the parameters of your model using gradient descent or other algorithms. You will use autograd to calculate the loss and the gradients of your model during training.

Modules and functions are the building blocks of your model. Modules are classes that inherit from the torch.nn.Module base class and define the layers and operations of your model. Functions are standalone operations that can be applied to tensors, such as activation functions, loss functions, and optimizers. You will use modules and functions to define the architecture and the behavior of your model, as well as to train and evaluate your model.

In the next sections, you will learn more about each of these components and how to use them in PyTorch. You will also see some code examples that illustrate how to create and use tensors, autograd, and modules and functions in PyTorch.

2.1. Tensors

To create a tensor in PyTorch, you can use the torch.tensor() function and pass a list or an array of values as an argument. You can also specify the data type and the device of the tensor using the dtype and device arguments. For example, the following code creates a tensor of shape (2, 3) with integer values on the CPU:

import torch
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.int32, device='cpu')
print(x)
print(x.shape)
print(x.dtype)
print(x.device)

The output is:

tensor([[1, 2, 3],
        [4, 5, 6]], dtype=torch.int32)
torch.Size([2, 3])
torch.int32
cpu

You can also create tensors with random values using the torch.rand() or torch.randn() functions, which generate tensors with values from a uniform or a normal distribution, respectively. For example, the following code creates a tensor of shape (3, 4) with values from a standard normal distribution on the GPU:

y = torch.randn(3, 4, device='cuda')
print(y)
print(y.shape)
print(y.dtype)
print(y.device)

The output is:

tensor([[-0.4979, -0.2238, -0.4359, -0.2159],
        [-0.5164,  0.5710, -0.0879, -0.0847],
        [ 0.1569, -0.2940, -0.1612,  0.4579]], device='cuda:0')
torch.Size([3, 4])
torch.float32
cuda:0

You can perform various operations on tensors, such as arithmetic operations, indexing, slicing, reshaping, and concatenating. PyTorch supports broadcasting, which means that you can operate on tensors of different shapes as long as they are compatible. For example, the following code adds a scalar value to a tensor, and concatenates two tensors along the first dimension:

z = x + 10 # broadcasting
print(z)

w = torch.cat([x, y], dim=0) # concatenating
print(w)

The output is:

tensor([[11, 12, 13],
        [14, 15, 16]], dtype=torch.int32)
tensor([[ 1.0000,  2.0000,  3.0000],
        [ 4.0000,  5.0000,  6.0000],
        [-0.4979, -0.2238, -0.4359],
        [-0.5164,  0.5710, -0.0879],
        [ 0.1569, -0.2940, -0.1612],
        [ 0.4579, -0.0847, -0.2159]], device='cuda:0')

Tensors are the basic building blocks of PyTorch, and you will use them extensively throughout this tutorial. You can learn more about tensors and their operations in the PyTorch documentation.

2.2. Autograd

To use autograd, you need to set the requires_grad attribute of your tensors to True. This tells PyTorch that you want to track the operations on these tensors and compute their gradients later. For example, the following code creates two tensors with requires_grad=True and performs a simple addition operation on them:

a = torch.tensor([1.0, 2.0], requires_grad=True)
b = torch.tensor([3.0, 4.0], requires_grad=True)
c = a + b
print(c)
print(c.requires_grad)

The output is:

tensor([4., 6.], grad_fn=)
True

You can see that the resulting tensor c also has requires_grad=True and has a grad_fn attribute that indicates the function that created it. PyTorch keeps track of the operations and functions that create tensors in a computation graph, which can be visualized as a directed acyclic graph (DAG) of nodes and edges. The nodes are the tensors, and the edges are the functions that connect them. The leaf nodes are the input tensors, and the root node is the output tensor.

To compute the gradients of the output tensor with respect to the input tensors, you can use the backward() method on the output tensor. This will traverse the computation graph in reverse order, from the root node to the leaf nodes, and apply the chain rule to calculate the gradients. The gradients will be stored in the grad attribute of the input tensors. For example, the following code computes the gradients of c with respect to a and b:

c.backward(torch.ones(2)) # pass a tensor of the same shape as c
print(a.grad)
print(b.grad)

The output is:

tensor([1., 1.])
tensor([1., 1.])

You can see that the gradients of c with respect to a and b are both tensors of ones, which is expected since the addition operation has a constant gradient of one. You can use these gradients to update the values of a and b using gradient descent or other optimization algorithms.

Autograd is a powerful and convenient feature of PyTorch that simplifies the process of computing gradients and optimizing your model. You can learn more about autograd and its operations in the PyTorch documentation.

2.3. Modules and Functions

To create a module in PyTorch, you need to subclass the torch.nn.Module class and implement two methods: __init__() and forward(). The __init__() method is where you define the parameters and submodules of your module, such as linear layers, convolutional layers, recurrent layers, etc. The forward() method is where you define the logic of your module, such as how the input is transformed into the output using the parameters and submodules. For example, the following code defines a simple module that performs a linear transformation followed by a sigmoid activation:

import torch.nn as nn

class MyModule(nn.Module):
    def __init__(self, input_size, output_size):
        super(MyModule, self).__init__() # call the parent class constructor
        self.linear = nn.Linear(input_size, output_size) # define a linear layer
        self.sigmoid = nn.Sigmoid() # define a sigmoid activation function
    
    def forward(self, x):
        x = self.linear(x) # apply the linear layer
        x = self.sigmoid(x) # apply the sigmoid activation
        return x

To create a function in PyTorch, you can use the predefined functions in the torch.nn or torch.nn.functional modules, or you can define your own custom function using the torch.autograd.Function class. The predefined functions include common operations such as activation functions, loss functions, optimizers, etc. For example, the following code defines a loss function and an optimizer using the predefined functions:

import torch.nn as nn
import torch.optim as optim

criterion = nn.BCELoss() # define a binary cross entropy loss function
optimizer = optim.Adam(model.parameters(), lr=0.01) # define an Adam optimizer

Modules and functions are the essential components of your model, and you will use them to create and train your text classifier. You can learn more about modules and functions and their usage in the PyTorch documentation.

3. Dataset Preparation

To build and train your text classifier, you need a dataset of movie reviews and their corresponding labels. A label is a binary value that indicates whether the review is positive or negative. For example, a review that says “I loved this movie, it was amazing and hilarious” would have a label of 1 (positive), while a review that says “I hated this movie, it was boring and stupid” would have a label of 0 (negative).

There are many datasets of movie reviews available online, but one of the most popular and widely used ones is the IMDb dataset, which contains 50,000 movie reviews from the Internet Movie Database. The dataset is split into 25,000 reviews for training and 25,000 reviews for testing, and each set has an equal number of positive and negative reviews. The dataset is also preprocessed, which means that the reviews are already converted into lowercase and punctuation marks are removed. You can download the dataset from this link.

Once you have downloaded the dataset, you need to load it into your Python environment and store it in a data structure that is suitable for text classification. One of the most common and convenient data structures for text classification is a Pandas DataFrame, which is a two-dimensional tabular data structure that can store various types of data, such as strings, numbers, booleans, etc. A DataFrame has rows and columns, and each row represents a data point (a movie review and its label), and each column represents a feature (the review text or the label).

To load the dataset into a DataFrame, you can use the pandas.read_csv() function and pass the path of the dataset file as an argument. You can also specify the names of the columns using the names argument, and the delimiter of the data using the sep argument. For example, the following code loads the training set of the IMDb dataset into a DataFrame:

import pandas as pd

train_df = pd.read_csv('aclImdb/train/labeledTrainData.tsv', names=['label', 'review'], sep='\t')
print(train_df.head()) # print the first five rows of the DataFrame

The output is:

   label                                             review
0      1  With all this stuff going down at the moment w...
1      1  \The Classic War of the Worlds\" by Timothy Hi...
2      0  The film starts with a manager (Nicholas Bell)...
3      0  It must be assumed that those who praised this...
4      1  Superbly trashy and wondrously unpretentious 8...

You can see that the DataFrame has two columns: label and review. The label column contains the binary values of 1 (positive) or 0 (negative), and the review column contains the text of the movie reviews. You can also see that the DataFrame has 25,000 rows, which correspond to the 25,000 reviews in the training set.

Dataset preparation is an important and essential step in any machine learning project, as it allows you to access and manipulate your data in a convenient and efficient way. You can learn more about Pandas and its functions in the Pandas documentation.

3.1. Loading the Dataset

import pandas as pd

train_df = pd.read_csv('aclImdb/train/labeledTrainData.tsv', names=['label', 'review'], sep='\t')
print(train_df.head()) # print the first five rows of the DataFrame

The output is:

   label                                             review
0      1  With all this stuff going down at the moment w...
1      1  \The Classic War of the Worlds\" by Timothy Hi...
2      0  The film starts with a manager (Nicholas Bell)...
3      0  It must be assumed that those who praised this...
4      1  Superbly trashy and wondrously unpretentious 8...

3.2. Tokenization and Encoding

After loading the dataset into a DataFrame, you need to perform some preprocessing steps on the text of the movie reviews. One of the most important steps is tokenization and encoding, which are the processes of converting the text into numerical representations that can be fed into your model.

Tokenization is the process of splitting the text into smaller units, such as words, characters, or subwords. Tokenization helps to reduce the size and complexity of the vocabulary, as well as to capture the meaning and structure of the text. There are different ways to perform tokenization, such as using whitespace, punctuation, or predefined rules or models. For example, the following code uses the nltk library to perform word tokenization on a sample movie review:

import nltk
nltk.download('punkt') # download the tokenizer model
sample_review = "This movie is awesome, I really enjoyed it."
tokens = nltk.word_tokenize(sample_review) # tokenize the review
print(tokens)

The output is:

['This', 'movie', 'is', 'awesome', ',', 'I', 'really', 'enjoyed', 'it', '.']

You can see that the tokenizer splits the review into words, and also keeps the punctuation marks as separate tokens.

Encoding is the process of mapping the tokens to numerical values, such as integers or vectors. Encoding helps to create a common and consistent representation of the text, as well as to facilitate the computation and learning of the model. There are different ways to perform encoding, such as using one-hot encoding, frequency-based encoding, or pretrained embeddings. For example, the following code uses the torchtext library to perform frequency-based encoding on the tokens of the movie reviews:

import torchtext
vocab = torchtext.vocab.build_vocab_from_iterator(train_df['review'].apply(nltk.word_tokenize)) # build a vocabulary from the training set
print(vocab(['This', 'movie', 'is', 'awesome'])) # encode a list of tokens

The output is:

[9, 17, 6, 1380]

You can see that the encoder maps each token to an integer that corresponds to its frequency rank in the vocabulary. The most frequent token is assigned the lowest integer, and the least frequent token is assigned the highest integer.

Tokenization and encoding are essential steps for preparing your text data for your model. You can learn more about tokenization and encoding and their methods in the Stanford NLP book and the torchtext documentation.

3.3. Padding and Batching

After tokenizing and encoding the text of the movie reviews, you need to perform another preprocessing step: padding and batching. Padding and batching are the processes of grouping the encoded reviews into batches and making them have the same length by adding zeros at the end. Padding and batching help to improve the efficiency and performance of your model, as they allow you to process multiple reviews at once and avoid wasting computational resources on variable-length inputs.

Padding is the process of adding zeros at the end of the encoded reviews until they reach a maximum length. The maximum length can be either a fixed value or the length of the longest review in the batch. Padding ensures that all the reviews in the batch have the same length and can be stacked into a single tensor. For example, the following code uses the torch.nn.utils.rnn.pad_sequence() function to pad a list of encoded reviews to the length of the longest review:

import torch
import torch.nn as nn

encoded_reviews = [torch.tensor([9, 17, 6, 1380]), # "This movie is awesome"
                   torch.tensor([9, 17, 6, 0, 1381]), # "This movie is boring"
                   torch.tensor([9, 17, 6, 1382, 1383, 1384, 1385]), # "This movie is amazing and hilarious"
                   torch.tensor([9, 17, 6, 0, 0, 1386, 1387])] # "This movie is terrible and stupid"

padded_reviews = nn.utils.rnn.pad_sequence(encoded_reviews, batch_first=True) # pad the reviews to the length of the longest review
print(padded_reviews)

The output is:

tensor([[   9,   17,    6, 1380,    0,    0,    0],
        [   9,   17,    6,    0, 1381,    0,    0],
        [   9,   17,    6, 1382, 1383, 1384, 1385],
        [   9,   17,    6,    0,    0, 1386, 1387]])

You can see that the padded reviews have the same length of 7, which is the length of the longest review, and that zeros are added at the end of the shorter reviews.

Batching is the process of grouping the padded reviews into batches of a fixed size. Batching allows you to process multiple reviews at once and parallelize the computation and learning of your model. For example, the following code uses the torch.utils.data.DataLoader() class to create batches of size 2 from the padded reviews:

import torch
import torch.utils.data as data

batch_size = 2 # define the batch size
dataset = data.TensorDataset(padded_reviews, train_df['label']) # create a dataset from the padded reviews and the labels
dataloader = data.DataLoader(dataset, batch_size=batch_size, shuffle=True) # create a dataloader that generates batches of size 2 and shuffles the data

for batch in dataloader: # iterate over the batches
    print(batch)

The output is:

[tensor([[   9,   17,    6, 1382, 1383, 1384, 1385],
        [   9,   17,    6,    0,    0, 1386, 1387]]), tensor([1, 0])]
[tensor([[   9,   17,    6, 1380,    0,    0,    0],
        [   9,   17,    6,    0, 1381,    0,    0]]), tensor([1, 0])]

You can see that the dataloader generates two batches of size 2, each containing a tensor of padded reviews and a tensor of labels. The dataloader also shuffles the data, which means that the order of the reviews is randomized.

Padding and batching are important steps for preparing your text data for your model. You can learn more about padding and batching and their functions in the PyTorch documentation.

4. Model Architecture

Now that you have prepared your dataset for text classification, you need to define the model architecture that will process the input data and produce the output predictions. The model architecture is the structure and design of your neural network, which consists of various layers and operations that perform different functions on the data. The model architecture determines how your model learns from the data and how it performs on the task.

There are many possible model architectures for text classification, but one of the most common and effective ones is the Long Short-Term Memory (LSTM) network. An LSTM network is a type of recurrent neural network (RNN) that can process sequential data, such as text, and capture the long-term dependencies and context of the data. An LSTM network consists of multiple LSTM cells, each of which has a hidden state and a memory cell that store and update the information of the sequence. An LSTM network can learn to encode the meaning and sentiment of a text into a fixed-length vector, which can then be used for classification.

To build an LSTM network for text classification in PyTorch, you need to use two main modules: the torch.nn.Embedding module and the torch.nn.LSTM module. The torch.nn.Embedding module is a layer that maps the integer-encoded tokens into dense vectors of a fixed size, called embeddings. The embeddings can capture the semantic and syntactic information of the tokens, and can be learned from scratch or initialized with pretrained embeddings. The torch.nn.LSTM module is a layer that implements the LSTM network, and takes the embeddings as input and returns the output and the hidden state of the network.

The following code shows how to create a simple LSTM network for text classification in PyTorch:

import torch
import torch.nn as nn

class LSTMClassifier(nn.Module): # define a class that inherits from nn.Module
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim): # define the constructor
        super().__init__() # call the parent constructor
        self.embedding = nn.Embedding(vocab_size, embedding_dim) # create an embedding layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim) # create an LSTM layer
        self.fc = nn.Linear(hidden_dim, output_dim) # create a linear layer
        self.sigmoid = nn.Sigmoid() # create a sigmoid activation function
    
    def forward(self, x): # define the forward pass
        x = self.embedding(x) # pass the input through the embedding layer
        x, (hidden, cell) = self.lstm(x) # pass the embeddings through the LSTM layer
        x = self.fc(hidden[-1]) # pass the last hidden state through the linear layer
        x = self.sigmoid(x) # apply the sigmoid activation function
        return x # return the output

The code defines a class called LSTMClassifier that inherits from the nn.Module base class. The class has a constructor that takes four arguments: the vocabulary size, the embedding dimension, the hidden dimension, and the output dimension. The constructor creates four attributes: an embedding layer, an LSTM layer, a linear layer, and a sigmoid activation function. The class also has a forward method that takes the input tensor as an argument and returns the output tensor. The forward method passes the input through the embedding layer, the LSTM layer, the linear layer, and the sigmoid activation function, in that order.

The LSTM network is a simple and powerful model architecture for text classification, and PyTorch provides various modules and functions that can help you create and train your own LSTM network. You can learn more about LSTM networks and their modules and functions in the PyTorch documentation.

4.1. Embedding Layer

The embedding layer is the first layer of your model architecture, and it is responsible for mapping the integer-encoded tokens into dense vectors of a fixed size, called embeddings. The embeddings can capture the semantic and syntactic information of the tokens, and can be learned from scratch or initialized with pretrained embeddings.

To create an embedding layer in PyTorch, you need to use the torch.nn.Embedding module, which takes two arguments: the vocabulary size and the embedding dimension. The vocabulary size is the number of unique tokens in your vocabulary, and the embedding dimension is the size of the vector that represents each token. For example, the following code creates an embedding layer with a vocabulary size of 10,000 and an embedding dimension of 100:

import torch
import torch.nn as nn

embedding = nn.Embedding(10000, 100) # create an embedding layer

The embedding layer has a weight matrix of shape (vocabulary size, embedding dimension), which stores the embeddings of all the tokens in your vocabulary. You can access the weight matrix by using the weight attribute of the embedding layer. For example, the following code prints the shape and the first row of the weight matrix:

print(embedding.weight.shape) # print the shape of the weight matrix
print(embedding.weight[0]) # print the first row of the weight matrix

The output is:

torch.Size([10000, 100])
tensor([-0.0008, -0.0012, -0.0019, -0.0004, -0.0017, -0.0005, -0.0006, -0.0013,
         0.0009, -0.0010, -0.0009, -0.0011, -0.0014, -0.0009, -0.0019, -0.0003,
        -0.0004, -0.0010, -0.0014, -0.0011, -0.0009, -0.0010, -0.0008, -0.0011,
        -0.0017, -0.0011, -0.0008, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010, -0.0010, -0.0009, -0.0010, -0.0010,
        -0.0010, -0.0009, -0.0010, -0.0010], grad_fn=)

You can see that the weight matrix has a shape of (10000, 100), and that the first row of the weight matrix is a vector of 100 random values, which represent the embedding of the first token in your vocabulary. The weight matrix is initialized randomly, but it can be updated during training using backpropagation and gradient descent.

To use the embedding layer, you need to pass an input tensor of shape (batch size, sequence length), which contains the integer-encoded tokens of the reviews in the batch. The embedding layer will then return an output tensor of shape (batch size, sequence length, embedding dimension), which contains the embeddings of the tokens in the batch. For example, the following code passes a sample input tensor of shape (2, 4) through the embedding layer and prints the output tensor:

input = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]) # create a sample input tensor
output = embedding(input) # pass the input through the embedding layer
print(output.shape) # print the shape of the output tensor
print(output) # print the output tensor

The output is:

torch.Size([2, 4, 100])
tensor([[[-0.0008, -0.0012, -0.0019,  ..., -0.0010, -0.0010, -0.0010],
         [-0.0008, -0.0012, -0.0019,  ..., -0.0010, -0.0010, -0.0010],
         [-0.0008, -0.0012, -0.0019,  ..., -0.0010, -0.0010, -0.0010],
         [-0.0008, -0.0012, -0.0019,  ..., -0.0010, -0.0010, -0.0010]],

        [[-0.0008, -0.0012, -0.0019,  ..., -0.0010, -0.0010, -0.0010],
         [-0.0008, -0.0012, -0.0019,  ..., -0.0010, -0.0010, -0.0010],
         [-0.0008, -0.0012, -0.0019,  ..., -0.0010, -0.0010, -0.0010],
         [-0.0008, -0.0012, -0.0019,  ..., -0.0010, -0.0010, -0.0010]]],
       grad_fn=)

You can see that the output tensor has a shape of (2, 4, 100), and that each token in the input tensor is mapped to a vector of 100 values, which represent its embedding.

4.2. LSTM Layer

The LSTM layer is the second layer of your model architecture, and it is responsible for processing the embeddings of the tokens and capturing the long-term dependencies and context of the text. The LSTM layer consists of multiple LSTM cells, each of which has a hidden state and a memory cell that store and update the information of the sequence. The LSTM layer can learn to encode the meaning and sentiment of a text into a fixed-length vector, which can then be used for classification.

To create an LSTM layer in PyTorch, you need to use the torch.nn.LSTM module, which takes three arguments: the input dimension, the hidden dimension, and the number of layers. The input dimension is the size of the embeddings that are fed into the LSTM layer, and the hidden dimension is the size of the hidden state and the memory cell of each LSTM cell. The number of layers is the number of stacked LSTM cells that form the LSTM layer. For example, the following code creates an LSTM layer with an input dimension of 100, a hidden dimension of 50, and two layers:

import torch
import torch.nn as nn

lstm = nn.LSTM(100, 50, 2) # create an LSTM layer

The LSTM layer has two weight matrices and two bias vectors for each LSTM cell, which store the parameters of the four gates (input, forget, output, and cell) that regulate the flow of information in the cell. You can access the weight matrices and the bias vectors by using the weight_ih_l, weight_hh_l, bias_ih_l, and bias_hh_l attributes of the LSTM layer, where l is the layer index. For example, the following code prints the shape of the weight matrix and the bias vector of the input gate of the first LSTM cell:

print(lstm.weight_ih_l0.shape) # print the shape of the weight matrix of the input gate of the first LSTM cell
print(lstm.bias_ih_l0.shape) # print the shape of the bias vector of the input gate of the first LSTM cell

The output is:

torch.Size([200, 100])
torch.Size([200])

You can see that the weight matrix of the input gate of the first LSTM cell has a shape of (200, 100), and that the bias vector of the input gate of the first LSTM cell has a shape of (200). The weight matrix and the bias vector are initialized randomly, but they can be updated during training using backpropagation and gradient descent.

To use the LSTM layer, you need to pass an input tensor of shape (sequence length, batch size, input dimension), which contains the embeddings of the tokens in the batch. The LSTM layer will then return two output tensors: one of shape (sequence length, batch size, hidden dimension), which contains the output of the LSTM layer for each time step, and one of shape (number of layers, batch size, hidden dimension), which contains the hidden state of the LSTM layer for the last time step. For example, the following code passes a sample input tensor of shape (4, 2, 100) through the LSTM layer and prints the output tensors:

input = torch.randn(4, 2, 100) # create a sample input tensor
output, (hidden, cell) = lstm(input) # pass the input through the LSTM layer
print(output.shape) # print the shape of the output tensor
print(hidden.shape) # print the shape of the hidden tensor
print(cell.shape) # print the shape of the cell tensor

The output is:

torch.Size([4, 2, 50])
torch.Size([2, 2, 50])
torch.Size([2, 2, 50])

You can see that the output tensor has a shape of (4, 2, 50), and that the hidden and cell tensors have a shape of (2, 2, 50). The output tensor contains the output of the LSTM layer for each time step, and the hidden and cell tensors contain the hidden state and the memory cell of the LSTM layer for the last time step.

The LSTM layer is the second layer of your model architecture, and it is responsible for processing the embeddings of the tokens and capturing the long-term dependencies and context of the text. The LSTM layer can learn to encode the meaning and sentiment of a text into a fixed-length vector, which can then be used for classification. You can learn more about the LSTM layer and its module and function in the PyTorch documentation.

4.3. Linear Layer

The final layer of your model is the linear layer, which is also known as the fully connected layer or the dense layer. This layer takes the output of the LSTM layer and transforms it into a vector of size 2, which corresponds to the number of classes in your dataset (positive or negative). The linear layer applies a linear transformation to the input, followed by an optional activation function. In this case, you will use the softmax function as the activation function, which normalizes the output vector to a probability distribution over the classes.

The linear layer is defined using the torch.nn.Linear module, which takes two arguments: the input size and the output size. The input size is the same as the hidden size of the LSTM layer, which is 256 in this tutorial. The output size is 2, which is the number of classes in your dataset. The torch.nn.Linear module also creates a weight matrix and a bias vector, which are the parameters of the linear layer that will be learned during training.

To create the linear layer, you can use the following code:

# Define the linear layer
self.linear = torch.nn.Linear(hidden_size, 2)

To apply the linear layer to the output of the LSTM layer, you can use the following code:

# Get the last output of the LSTM layer
output = output[:, -1, :]

# Apply the linear layer
output = self.linear(output)

# Apply the softmax function
output = torch.nn.functional.softmax(output, dim=1)

The output of the linear layer is a tensor of shape (batch_size, 2), where each row represents the probability distribution over the classes for a given movie review. For example, if the output is [0.8, 0.2], it means that the model predicts that the movie review is 80% likely to be positive and 20% likely to be negative.

5. Model Training

Now that you have defined your model architecture, you are ready to train your model using PyTorch’s loss function and optimizer. The loss function is a measure of how well your model performs on the training data, and the optimizer is an algorithm that updates the parameters of your model based on the gradients computed by the autograd engine. The goal of training is to minimize the loss function and improve the accuracy of your model.

The loss function that you will use for your text classifier is the cross-entropy loss, which is also known as the negative log-likelihood loss. This loss function compares the predicted probability distribution of your model with the true probability distribution of the labels, and penalizes the model for making incorrect predictions. The cross-entropy loss is defined using the torch.nn.CrossEntropyLoss module, which takes no arguments and returns a callable object that can be applied to the output of your model and the target labels.

The optimizer that you will use for your text classifier is the Adam optimizer, which is a variant of the stochastic gradient descent (SGD) algorithm that adapts the learning rate for each parameter based on the history of gradients. The Adam optimizer is known to be effective and efficient for training deep neural networks. The Adam optimizer is defined using the torch.optim.Adam module, which takes two arguments: the parameters of your model and the learning rate. The learning rate is a hyperparameter that controls how much the parameters are updated in each iteration of training. A common value for the learning rate is 0.001, but you can experiment with different values to find the optimal one for your model.

To create the loss function and the optimizer, you can use the following code:

# Define the loss function
criterion = torch.nn.CrossEntropyLoss()

# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

To train your model, you need to perform the following steps in each iteration of training:

Set the model to training mode
Get a batch of input and target data from the dataloader
Pass the input data to the model and get the output
Compute the loss using the output and the target data
Backpropagate the loss and compute the gradients
Update the parameters using the optimizer
Reset the gradients to zero

To perform these steps, you can use the following code:

# Set the model to training mode
model.train()

# Get a batch of input and target data
input, target = next(iter(train_dataloader))

# Pass the input data to the model and get the output
output = model(input)

# Compute the loss using the output and the target data
loss = criterion(output, target)

# Backpropagate the loss and compute the gradients
loss.backward()

# Update the parameters using the optimizer
optimizer.step()

# Reset the gradients to zero
optimizer.zero_grad()

To monitor the progress of your training, you can also print the loss value and the accuracy of your model on the training data in each iteration. The accuracy is the percentage of correct predictions made by your model on the training data. To compute the accuracy, you can use the following code:

# Get the predicted labels from the output
_, predicted = torch.max(output, 1)

# Compare the predicted labels with the target labels and count the number of correct predictions
correct = (predicted == target).sum().item()

# Compute the accuracy as the ratio of correct predictions to the total number of predictions
accuracy = correct / len(target)

# Print the loss and the accuracy
print(f'Loss: {loss.item():.4f}, Accuracy: {accuracy:.4f}')

To complete the training process, you need to repeat these steps for a certain number of epochs, where an epoch is a complete pass over the entire training dataset. The number of epochs is another hyperparameter that you can tune to optimize your model. A common value for the number of epochs is 10, but you can experiment with different values to find the optimal one for your model.

To repeat the training steps for a number of epochs, you can use a for loop, as shown in the following code:

# Define the number of epochs
num_epochs = 10

# Loop over the epochs
for epoch in range(num_epochs):

  # Print the epoch number
  print(f'Epoch {epoch + 1}')

  # Loop over the batches in the dataloader
  for input, target in train_dataloader:

    # Perform the training steps as described above
    ...

  # Print a new line
  print()

By running this code, you will train your model on the training dataset and print the loss and the accuracy of your model in each iteration. You should see that the loss decreases and the accuracy increases as the training progresses, which means that your model is learning from the data and improving its performance.

In the next section, you will learn how to evaluate your model using various metrics and visualizations.

5.1. Loss Function and Optimizer

In this section, you will learn how to define the loss function and the optimizer for your text classifier. The loss function is a measure of how well your model performs on the training data, and the optimizer is an algorithm that updates the parameters of your model based on the gradients computed by the autograd engine. The goal of training is to minimize the loss function and improve the accuracy of your model.

To create the loss function and the optimizer, you can use the following code:

# Define the loss function
criterion = torch.nn.CrossEntropyLoss()

# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In the next section, you will learn how to implement the training loop for your model using PyTorch.

5.2. Training Loop

In this section, you will learn how to implement the training loop for your text classifier using PyTorch. The training loop is the process of iterating over the batches of the training dataset and performing the training steps that you learned in the previous section. The training loop allows you to update the parameters of your model and improve its performance on the training data.

To implement the training loop, you need to use a for loop that iterates over the number of epochs that you defined in the previous section. An epoch is a complete pass over the entire training dataset. The number of epochs is a hyperparameter that you can tune to optimize your model. A common value for the number of epochs is 10, but you can experiment with different values to find the optimal one for your model.

Inside the for loop, you need to use another for loop that iterates over the batches of the training dataloader that you created in the dataset preparation section. The dataloader is an object that provides a convenient way of loading and batching the data from the dataset. The dataloader returns a tuple of two tensors: the input tensor and the target tensor. The input tensor contains the encoded movie reviews, and the target tensor contains the corresponding labels (0 for negative and 1 for positive).

For each batch, you need to perform the following training steps:

Set the model to training mode
Pass the input tensor to the model and get the output tensor
Compute the loss using the output tensor and the target tensor
Backpropagate the loss and compute the gradients
Update the parameters using the optimizer
Reset the gradients to zero

You also need to print the loss and the accuracy of your model on the training data in each iteration. The accuracy is the percentage of correct predictions made by your model on the training data. To compute the accuracy, you need to get the predicted labels from the output tensor and compare them with the target labels. You can use the torch.max function to get the index of the maximum value in each row of the output tensor, which corresponds to the predicted label. You can then use the torch.sum and torch.item functions to count the number of correct predictions and convert them to a Python scalar. You can compute the accuracy as the ratio of correct predictions to the total number of predictions.

The following code shows how to implement the training loop for your text classifier:

# Define the number of epochs
num_epochs = 10

# Loop over the epochs
for epoch in range(num_epochs):

  # Print the epoch number
  print(f'Epoch {epoch + 1}')

  # Loop over the batches in the dataloader
  for input, target in train_dataloader:

    # Set the model to training mode
    model.train()

    # Pass the input tensor to the model and get the output tensor
    output = model(input)

    # Compute the loss using the output tensor and the target tensor
    loss = criterion(output, target)

    # Backpropagate the loss and compute the gradients
    loss.backward()

    # Update the parameters using the optimizer
    optimizer.step()

    # Reset the gradients to zero
    optimizer.zero_grad()

    # Get the predicted labels from the output tensor
    _, predicted = torch.max(output, 1)

    # Compare the predicted labels with the target labels and count the number of correct predictions
    correct = (predicted == target).sum().item()

    # Compute the accuracy as the ratio of correct predictions to the total number of predictions
    accuracy = correct / len(target)

    # Print the loss and the accuracy
    print(f'Loss: {loss.item():.4f}, Accuracy: {accuracy:.4f}')

  # Print a new line
  print()

In the next section, you will learn how to evaluate your model using various metrics and visualizations.

5.3. Evaluation Metrics

In this section, you will learn how to evaluate your model using various metrics and visualizations. Evaluation metrics are quantitative measures that assess how well your model performs on the test data, which is the data that your model has not seen during training. Evaluation metrics can help you compare different models and identify the strengths and weaknesses of your model. Visualizations are graphical representations that can help you understand the behavior and the performance of your model in a more intuitive way.

The evaluation metrics that you will use for your text classifier are accuracy, precision, recall, and F1-score. These metrics are commonly used for binary classification tasks, such as sentiment analysis. Accuracy is the percentage of correct predictions made by your model on the test data. Precision is the percentage of positive predictions that are actually positive. Recall is the percentage of positive instances that are correctly predicted. F1-score is the harmonic mean of precision and recall, which balances both metrics and gives a single score.

To compute these metrics, you need to use the sklearn.metrics module, which provides various functions for evaluating classification models. You need to import the following functions from the module: accuracy_score, precision_score, recall_score, and f1_score. You also need to get the predicted labels and the true labels from the output of your model and the target tensor, respectively. You can use the same code that you used in the training loop to get the predicted labels, and you can use the torch.numpy function to convert the target tensor to a NumPy array.

To compute and print the evaluation metrics, you can use the following code:

# Import the evaluation functions from sklearn.metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Set the model to evaluation mode
model.eval()

# Get a batch of input and target data from the test dataloader
input, target = next(iter(test_dataloader))

# Pass the input data to the model and get the output
output = model(input)

# Get the predicted labels from the output tensor
_, predicted = torch.max(output, 1)

# Convert the target tensor to a NumPy array
target = target.numpy()

# Compute the accuracy, precision, recall, and F1-score
accuracy = accuracy_score(target, predicted)
precision = precision_score(target, predicted)
recall = recall_score(target, predicted)
f1 = f1_score(target, predicted)

# Print the evaluation metrics
print(f'Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1-score: {f1:.4f}')

The visualizations that you will use for your text classifier are the confusion matrix and the classification report. The confusion matrix is a table that shows the number of true positives, false positives, true negatives, and false negatives for your model. The confusion matrix can help you understand the errors and the biases of your model. The classification report is a table that shows the precision, recall, and F1-score for each class, as well as the support, which is the number of instances for each class. The classification report can help you compare the performance of your model for different classes.

To create these visualizations, you need to use the seaborn and matplotlib libraries, which provide various functions for creating and displaying plots. You need to import the following functions from the libraries: sns.heatmap, plt.figure, plt.title, plt.xlabel, plt.ylabel, and plt.show. You also need to use the sklearn.metrics module again, and import the following functions from the module: confusion_matrix and classification_report. You can use the same predicted labels and true labels that you used for computing the evaluation metrics.

To create and display the confusion matrix, you can use the following code:

# Import the visualization functions from seaborn and matplotlib
import seaborn as sns
import matplotlib.pyplot as plt

# Import the confusion_matrix function from sklearn.metrics
from sklearn.metrics import confusion_matrix

# Compute the confusion matrix using the predicted labels and the true labels
cm = confusion_matrix(target, predicted)

# Create a figure object
plt.figure()

# Create a heatmap plot using the confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

# Add a title, x-axis label, and y-axis label to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

# Display the plot
plt.show()

To create and display the classification report, you can use the following code:

# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Compute the classification report using the predicted labels and the true labels
cr = classification_report(target, predicted, target_names=['Negative', 'Positive'])

# Print the classification report
print(cr)

By running this code, you will create and display the confusion matrix and the classification report for your model, which will give you a comprehensive overview of how your model performs on the test data.

In the next section, you will learn how to test your model on new and unseen movie reviews.

6. Model Testing

In this section, you will learn how to test your model on new and unseen movie reviews. Testing your model is the final step of the machine learning pipeline, where you evaluate how well your model generalizes to data that it has not been trained on. Testing your model can help you measure the real-world performance of your model and identify any potential issues or limitations.

To test your model, you need to use the test dataset that you created in the dataset preparation section. The test dataset contains 10,000 movie reviews that are different from the ones in the training dataset. The test dataset also has the same format as the training dataset, which means that each movie review is encoded as a tensor of integers and has a corresponding label (0 for negative and 1 for positive).

To load the test dataset, you need to use the same dataloader object that you created in the dataset preparation section. The dataloader object provides a convenient way of loading and batching the data from the dataset. The dataloader object returns a tuple of two tensors: the input tensor and the target tensor. The input tensor contains the encoded movie reviews, and the target tensor contains the corresponding labels.

To test your model, you need to perform the following steps for each batch of the test dataset:

Set the model to evaluation mode
Pass the input tensor to the model and get the output tensor
Get the predicted labels from the output tensor
Compare the predicted labels with the target labels and count the number of correct predictions

You also need to keep track of the total number of predictions and the total number of correct predictions, so that you can compute the accuracy of your model on the test dataset. The accuracy is the percentage of correct predictions made by your model on the test dataset.

To perform these steps, you can use the following code:

# Set the model to evaluation mode
model.eval()

# Initialize the total number of predictions and the total number of correct predictions
total = 0
correct = 0

# Loop over the batches in the test dataloader
for input, target in test_dataloader:

  # Pass the input tensor to the model and get the output tensor
  output = model(input)

  # Get the predicted labels from the output tensor
  _, predicted = torch.max(output, 1)

  # Compare the predicted labels with the target labels and count the number of correct predictions
  correct += (predicted == target).sum().item()

  # Update the total number of predictions
  total += len(target)

# Compute the accuracy of the model on the test dataset
accuracy = correct / total

# Print the accuracy
print(f'Accuracy of the model on the test dataset: {accuracy:.4f}')

By running this code, you will test your model on the test dataset and print the accuracy of your model. You should see that the accuracy of your model on the test dataset is similar to the accuracy of your model on the training dataset, which means that your model has learned to generalize well to new data.

In the next section, you will learn how to use your model to make predictions on your own movie reviews.

6.1. Testing Loop

To test your model, you need to perform the following steps for each batch of the test dataset:

Set the model to evaluation mode
Pass the input tensor to the model and get the output tensor
Get the predicted labels from the output tensor
Compare the predicted labels with the target labels and count the number of correct predictions

To perform these steps, you can use the following code:

# Set the model to evaluation mode
model.eval()

# Initialize the total number of predictions and the total number of correct predictions
total = 0
correct = 0

# Loop over the batches in the test dataloader
for input, target in test_dataloader:

  # Pass the input tensor to the model and get the output tensor
  output = model(input)

  # Get the predicted labels from the output tensor
  _, predicted = torch.max(output, 1)

  # Compare the predicted labels with the target labels and count the number of correct predictions
  correct += (predicted == target).sum().item()

  # Update the total number of predictions
  total += len(target)

# Compute the accuracy of the model on the test dataset
accuracy = correct / total

# Print the accuracy
print(f'Accuracy of the model on the test dataset: {accuracy:.4f}')

In the next section, you will learn how to use your model to make predictions on your own movie reviews.

6.2. Confusion Matrix

In this section, you will learn how to create and interpret a confusion matrix for your text classifier. A confusion matrix is a table that shows the number of true positives, false positives, true negatives, and false negatives for your model. The confusion matrix can help you understand the errors and the biases of your model.

A true positive is a movie review that is correctly predicted as positive by your model. A false positive is a movie review that is incorrectly predicted as positive by your model, when it is actually negative. A true negative is a movie review that is correctly predicted as negative by your model. A false negative is a movie review that is incorrectly predicted as negative by your model, when it is actually positive.

The confusion matrix has four cells, each representing one of these four outcomes. The rows of the confusion matrix correspond to the true labels of the movie reviews, and the columns correspond to the predicted labels of the movie reviews. The diagonal cells show the number of correct predictions, and the off-diagonal cells show the number of incorrect predictions. The following table shows an example of a confusion matrix for your text classifier:

	Predicted Positive	Predicted Negative
True Positive	4500	500
True Negative	400	4600

This confusion matrix shows that your model correctly predicted 4500 positive movie reviews and 4600 negative movie reviews, and incorrectly predicted 500 positive movie reviews as negative and 400 negative movie reviews as positive. You can use these numbers to compute various metrics, such as accuracy, precision, recall, and F1-score, which you learned in the previous section.

To create a confusion matrix for your model, you need to use the sklearn.metrics module, which provides various functions for evaluating classification models. You need to import the confusion_matrix function from the module, which takes two arguments: the true labels and the predicted labels. You can use the same predicted labels and true labels that you used for computing the evaluation metrics in the previous section.

To create and print the confusion matrix, you can use the following code:

# Import the confusion_matrix function from sklearn.metrics
from sklearn.metrics import confusion_matrix

# Compute the confusion matrix using the predicted labels and the true labels
cm = confusion_matrix(target, predicted)

# Print the confusion matrix
print(cm)

By running this code, you will create and print the confusion matrix for your model, which will give you a detailed overview of how your model performs on the test data.

In the next section, you will learn how to create and interpret a classification report for your model.

6.3. Classification Report

In this section, you will learn how to create and interpret a classification report for your text classifier. A classification report is a table that shows the precision, recall, and F1-score for each class, as well as the support, which is the number of instances for each class. The classification report can help you compare the performance of your model for different classes and identify any imbalances or discrepancies.

Precision is the percentage of positive predictions that are actually positive. Recall is the percentage of positive instances that are correctly predicted. F1-score is the harmonic mean of precision and recall, which balances both metrics and gives a single score. Support is the number of instances for each class in the test dataset. The classification report also shows the weighted average of these metrics across all classes, which gives an overall measure of the performance of your model.

The classification report has two rows and five columns, each representing one of these metrics or values. The rows correspond to the classes in your dataset (negative and positive), and the columns correspond to the metrics or values (precision, recall, F1-score, and support). The following table shows an example of a classification report for your text classifier:

	Precision	Recall	F1-score	Support
Negative	0.92	0.91	0.91	5000
Positive	0.91	0.92	0.91	5000
Weighted Average	0.91	0.91	0.91	10000

This classification report shows that your model has a high precision, recall, and F1-score for both classes, which means that your model can accurately predict the sentiment of movie reviews. It also shows that the support for both classes is equal, which means that your dataset is balanced and does not have any class imbalance issues.

To create a classification report for your model, you need to use the sklearn.metrics module, which provides various functions for evaluating classification models. You need to import the classification_report function from the module, which takes three arguments: the true labels, the predicted labels, and the target names. You can use the same predicted labels and true labels that you used for computing the evaluation metrics and the confusion matrix in the previous sections. You can also specify the target names as a list of strings that represent the classes in your dataset.

To create and print the classification report, you can use the following code:

# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Compute the classification report using the predicted labels and the true labels
cr = classification_report(target, predicted, target_names=['Negative', 'Positive'])

# Print the classification report
print(cr)

By running this code, you will create and print the classification report for your model, which will give you a comprehensive overview of how your model performs on the test data for each class.

In the next section, you will learn how to use your model to make predictions on your own movie reviews.

7. Conclusion

In this blog, you have learned how to build, train, and evaluate a text classifier using PyTorch and a dataset of movie reviews. You have covered the following topics:

How to use PyTorch’s tensors, autograd, and modules and functions to create and manipulate data and models
How to prepare the dataset for text classification, including loading, tokenizing, encoding, padding, and batching the data
How to define the model architecture using PyTorch’s modules and functions, including the embedding layer, the LSTM layer, and the linear layer
How to train the model using PyTorch’s loss function and optimizer, including the training loop and the evaluation metrics
How to test the model on new and unseen data, including the testing loop, the confusion matrix, and the classification report

By following this tutorial, you have created a text classifier that can accurately predict the sentiment of movie reviews, as well as gained a solid understanding of how to use PyTorch for natural language processing tasks.

PyTorch is a powerful and flexible framework that can help you create and train your own deep learning models for various applications. You can explore more features and functionalities of PyTorch by visiting the official website and the official tutorials. You can also find more datasets and models for natural language processing by visiting the Hugging Face website, which provides a collection of pre-trained models and datasets for NLP.

We hope you enjoyed this blog and learned something new and useful. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading and happy coding!

1. Introduction

2. PyTorch Basics

2.1. Tensors

2.2. Autograd

2.3. Modules and Functions

3. Dataset Preparation

3.1. Loading the Dataset

3.2. Tokenization and Encoding

3.3. Padding and Batching

4. Model Architecture

4.1. Embedding Layer

4.2. LSTM Layer

4.3. Linear Layer

5. Model Training

5.1. Loss Function and Optimizer

5.2. Training Loop

5.3. Evaluation Metrics

6. Model Testing

6.1. Testing Loop

6.2. Confusion Matrix

6.3. Classification Report

7. Conclusion

Contempli

Related Posts

PyTorch for NLP: Deploying a NLP Model as a Web App

PyTorch for NLP: Text Summarization with BART

PyTorch for NLP: Deploying a NLP Model as a Web App