PyTorch for NLP: Working with Text Data

This blog teaches you how to preprocess, tokenize, and encode text data for NLP tasks using PyTorch, a popular deep learning framework.

Table of Contents

1. Introduction

PyTorch is a popular open-source deep learning framework that provides a flexible and expressive way to build and train neural networks. It is widely used for various applications, such as computer vision, natural language processing (NLP), reinforcement learning, and more.

One of the most common and challenging tasks in NLP is working with text data. Text data is unstructured, noisy, and often contains multiple languages, dialects, and domains. To use text data for NLP tasks, such as sentiment analysis, text summarization, machine translation, etc., we need to perform some preprocessing steps to transform the raw text into a suitable representation for the neural network.

In this blog, you will learn how to use PyTorch to perform the following steps for working with text data:

Preprocessing: cleaning, normalizing, and removing unwanted characters and tokens from the text.
Tokenization: splitting the text into smaller units, such as words, subwords, or characters.
Encoding: converting the tokens into numerical values, such as indices, one-hot vectors, or embeddings.

You will also learn how to use torchtext, a PyTorch library that provides tools and datasets for text data processing. Finally, you will build a simple text classifier with PyTorch using the preprocessed and encoded text data.

By the end of this blog, you will have a solid understanding of how to work with text data for NLP tasks using PyTorch. You will also be able to apply the same techniques to your own text data and build your own NLP models with PyTorch.

Are you ready to get started? Let’s dive in!

2. Loading and Exploring Text Data

The first step for working with text data is to load and explore the data. You need to know what kind of data you are dealing with, such as the size, format, language, domain, and quality of the data. This will help you to decide how to preprocess, tokenize, and encode the data for your NLP task.

In this section, you will learn how to load and explore a text dataset using PyTorch. You will use the IMDB movie reviews dataset, which contains 50,000 movie reviews labeled as positive or negative. This dataset is commonly used for sentiment analysis, a type of NLP task that aims to identify and extract the emotions and opinions expressed in a text.

To load the dataset, you will use the torch.utils.data module, which provides various classes and functions for loading, manipulating, and transforming data. You will use the Dataset class, which represents a collection of data samples, and the DataLoader class, which provides an iterator over a dataset. You will also use the pandas library, which is a popular tool for data analysis and manipulation.

Let’s start by importing the necessary libraries and loading the dataset as a pandas dataframe:

import torch
import pandas as pd
from torch.utils.data import Dataset, DataLoader

# Load the dataset as a pandas dataframe
df = pd.read_csv("IMDB Dataset.csv")

# Print the first five rows of the dataframe
print(df.head())

The output should look something like this:

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. 

The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

As you can see, the dataframe has two columns: review and sentiment. The review column contains the text of the movie review, and the sentiment column contains the label of the review, either positive or negative. You can also see that the reviews contain some HTML tags, such as
, which indicate line breaks.

To explore the dataset further, you can use some pandas methods and attributes, such as shape, info, describe, and value_counts. For example, you can check the number of rows and columns in the dataframe, the data types of the columns, the summary statistics of the columns, and the distribution of the labels:

# Check the number of rows and columns in the dataframe
print(df.shape)
# Output: (50000, 2)

# Check the data types of the columns
print(df.info())
# Output: 
# 
# RangeIndex: 50000 entries, 0 to 49999
# Data columns (total 2 columns):
#  #   Column     Non-Null Count  Dtype 
# ---  ------     --------------  ----- 
#  0   review     50000 non-null  object
#  1   sentiment  50000 non-null  object
# dtypes: object(2)
# memory usage: 781.4+ KB
# None

# Check the summary statistics of the columns
print(df.describe())
# Output: 
#                                                    review sentiment
# count                                               50000     50000
# unique                                              49582         2
# top     Loved today's show!!! It was a variety and not...  negative
# freq                                                    5     25000

# Check the distribution of the labels
print(df["sentiment"].value_counts())
# Output: 
# negative    25000
# positive    25000
# Name: sentiment, dtype: int64

From the output, you can see that the dataframe has 50,000 rows and 2 columns, both of which are of type object (string). You can also see that there are some duplicate reviews in the dataset, as the number of unique reviews is 49,582. You can also see that the dataset is balanced, as there are 25,000 positive and 25,000 negative reviews.

Now that you have loaded and explored the dataset, you can move on to the next step: preprocessing the text data.

3. Preprocessing Text Data

Before you can use text data for NLP tasks, you need to perform some preprocessing steps to clean and normalize the data. Preprocessing text data involves removing or modifying any unwanted or irrelevant characters, tokens, or features from the text. This can help to reduce the noise and complexity of the data, and make it more suitable for the neural network.

There are many ways to preprocess text data, depending on the type and quality of the data, and the goal of the NLP task. Some common preprocessing steps include:

Removing punctuation, numbers, special characters, HTML tags, etc.
Lowercasing or capitalizing the text.
Removing stopwords, which are common words that do not add much meaning to the text, such as “the”, “a”, “and”, etc.
Stemming or lemmatizing the text, which are techniques to reduce words to their root form, such as “running” to “run”.
Correcting spelling errors or typos.
Removing or replacing slang, abbreviations, contractions, etc.

The choice and order of the preprocessing steps depend on the specific characteristics and requirements of the text data and the NLP task. For example, if you are working with social media posts, you might want to remove or replace slang, abbreviations, and emoticons, but if you are working with formal documents, you might not need to do that. Similarly, if you are working with a case-sensitive task, such as named entity recognition, you might want to preserve the capitalization of the text, but if you are working with a case-insensitive task, such as sentiment analysis, you might want to lowercase the text.

In this section, you will learn how to preprocess the IMDB movie reviews dataset using PyTorch. You will use the torchtext.data module, which provides various classes and functions for text data preprocessing. You will use the Field class, which defines how to process a certain type of data, such as text or labels. You will also use some built-in methods and attributes of the Field class, such as lower, tokenize, stop_words, etc.

Let’s start by importing the necessary libraries and defining the fields for the text and label columns of the dataset:

import torch
import torchtext
from torchtext.data import Field, TabularDataset

# Define the field for the text column
TEXT = Field(sequential=True, # The data is sequential
             lower=True, # Lowercase the text
             tokenize="spacy", # Use spacy tokenizer
             stop_words="english", # Remove stopwords
             batch_first=True) # The first dimension of the tensor is the batch size

# Define the field for the label column
LABEL = Field(sequential=False, # The data is not sequential
              use_vocab=False, # Do not use vocabulary
              is_target=True, # The data is the target
              batch_first=True, # The first dimension of the tensor is the batch size
              dtype=torch.float) # The data type is float

As you can see, the Field class takes several arguments that specify how to process the data. For the text column, you set the sequential argument to True, as the text is a sequence of tokens. You also set the lower argument to True, to lowercase the text. You set the tokenize argument to “spacy”, to use the spacy tokenizer, which is a popular and powerful tool for natural language processing. You set the stop_words argument to “english”, to remove the stopwords from the text. You set the batch_first argument to True, to make the first dimension of the tensor be the batch size, which is a common convention in PyTorch.

For the label column, you set the sequential argument to False, as the label is not a sequence of tokens, but a single value. You also set the use_vocab argument to False, as you do not need to use a vocabulary for the label, which is already a numerical value. You set the is_target argument to True, to indicate that the label is the target value. You set the batch_first argument to True, for the same reason as before. You set the dtype argument to torch.float, to specify the data type of the label as a float.

Now that you have defined the fields, you can use them to create a TabularDataset object, which represents a dataset in a tabular format, such as a CSV file. You will use the TabularDataset.splits method, which takes the paths of the train, validation, and test files, and the fields for each column, and returns a tuple of TabularDataset objects for each split. You will also use the format argument to specify the format of the files as CSV:

# Create the TabularDataset objects for each split
train_data, valid_data, test_data = TabularDataset.splits(
    path=".", # The path where the files are located
    train="IMDB_train.csv", # The name of the train file
    validation="IMDB_valid.csv", # The name of the validation file
    test="IMDB_test.csv", # The name of the test file
    format="csv", # The format of the files
    fields=[("text", TEXT), ("label", LABEL)]) # The fields for each column

The output should be three TabularDataset objects, one for each split. You can check the length and the example attributes of each object to see how many examples and how the data looks like:

# Check the length of each object
print(len(train_data))
print(len(valid_data))
print(len(test_data))
# Output: 
# 40000
# 5000
# 5000

# Check the example attribute of the first object
print(vars(train_data.examples[0]))
# Output: 
# {'text': ['one', 'reviewers', 'mentioned', 'watching', '1', 'oz', 'episode', 'hooked', '.', 'right', ',', 'exactly', 'happened', '.', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence', ',', 'set', 'right', 'word', 'go', 'trust', '.', 'show', 'faint', 'hearted', 'timid', '.', 'show', 'pulls', 'punches', 'regards', 'drugs', ',', 'sex', 'violence', '.', 'hardcore', ',', 'classic', 'use', 'word', '.', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary', '.', 'tells', 'story', 'inmates', 'prison', 'emerald', 'city', ',', 'experimental', 'section', 'prison', 'inmates', 'get', 'chance', 'get', 'clink', 'stay', 'long', 'well', 'behaved', 'hard', 'place', 'stay', 'well', 'behaved', ',', 'especially', 'warring', 'factions', 'prison', 'want', 'kill', '.', 'em', 'city', 'home', 'neo', 'nazis', ',', 'muslims', ',', 'latinos', ',', 'christians', ',', 'italians', ',', 'irish', 'bikers', '.', 'scary', 'place', '.', 'fact', 'main', 'character', 'killed', 'end', 'first', 'episode', 'demonstrates', 'willing', 'show', 'go', '.', 'kill', 'characters', 'right', 'left', ',', 'makes', 'hardcore', 'drama', 'ever', 'appeared', 'television', '.', 'anywhere', '.', 'nick', 'leeson', 'trust', '(', 'betrayed', ')', 'guy', 'broke', 'bank', 'one', 'inmates', 'makes', 'interesting', 'viewing', '.', 'oz', 'never', 'boring', ',', 'always', 'controversial', ',', 'brilliantly', 'written', ',', 'produced', 'acted', '.', 'praise', 'show', '.', 'watch', '...', 'hooked', '.'], 'label': 1.0}

As you can see, the text column has been preprocessed according to the field definition. The text has been lowercased, tokenized, and stopwords have been removed. The label column has been kept as a float value. You can also see that the dataset has 40,000 examples for the train split, and 5,000 examples for the validation and test splits.

Now that you have created the TabularDataset objects, you can move on to the next step: tokenizing the text data.

4. Tokenizing Text Data

After preprocessing the text data, the next step is to tokenize the data. Tokenization is the process of splitting the text into smaller units, such as words, subwords, or characters. Tokens are the basic units of text that can be processed by the neural network.

There are different ways to tokenize text data, depending on the level of granularity and the vocabulary size. Some common tokenization methods are:

Word-level tokenization: This is the simplest and most common method, where the text is split by whitespace and punctuation. Each word is a token, and the vocabulary size is the number of unique words in the text.
Subword-level tokenization: This is a more advanced and efficient method, where the text is split by smaller units that capture the morphology and meaning of the words. Each subword is a token, and the vocabulary size is smaller than the word-level method. Some examples of subword-level tokenization algorithms are Byte Pair Encoding (BPE), WordPiece, and SentencePiece.
Character-level tokenization: This is the finest and most granular method, where the text is split by individual characters. Each character is a token, and the vocabulary size is the number of unique characters in the text.

The choice of the tokenization method depends on the type and complexity of the text data, and the goal of the NLP task. For example, if you are working with a large and diverse text corpus, you might want to use a subword-level tokenization method, as it can reduce the vocabulary size and handle rare and unknown words better. However, if you are working with a small and simple text corpus, you might want to use a word-level tokenization method, as it can preserve the meaning and structure of the words better.

In this section, you will learn how to tokenize the IMDB movie reviews dataset using PyTorch. You will use the torchtext.data module, which provides various classes and functions for text data tokenization. You will use the Field class, which defines how to tokenize a certain type of data, such as text or labels. You will also use some built-in methods and attributes of the Field class, such as tokenize, build_vocab, vocab, etc.

You have already defined the fields for the text and label columns of the dataset in the previous section, and you have already used the tokenize argument to specify the tokenizer for the text column. You have used the spacy tokenizer, which is a word-level tokenizer that can handle multiple languages and perform other linguistic tasks, such as part-of-speech tagging and dependency parsing. You can also use other tokenizers, such as “basic_english”, which is a simple and fast tokenizer that only splits the text by punctuation and whitespace, or “moses”, which is a tokenizer that follows the Moses tokenization rules, which are commonly used for machine translation.

To apply the tokenizer to the text column, you need to use the build_vocab method of the Field class, which builds the vocabulary for the field. The vocabulary is a mapping of tokens to numerical values, such as indices or frequencies. The vocabulary is essential for encoding the text data into numerical values that can be processed by the neural network.

You can use the vocab attribute of the Field class to access the vocabulary object, which has several attributes and methods, such as stoi, which is a dictionary of token to index, itos, which is a list of index to token, freqs, which is a dictionary of token to frequency, etc.

Let’s see how to use the build_vocab method and the vocab attribute to tokenize and build the vocabulary for the text column:

# Build the vocabulary for the text column
TEXT.build_vocab(train_data)

# Check the vocab attribute of the TEXT field
print(TEXT.vocab)
# Output: 
# 

# Check the size of the vocabulary
print(len(TEXT.vocab))
# Output: 
# 101636

# Check the most common tokens in the vocabulary
print(TEXT.vocab.freqs.most_common(10))
# Output: 
# [('.', 44848), (',', 40705), ('movie', 20787), ('film', 18438), ('"', 14742), ('one', 12724), ('like', 10523), ('good', 8215), ('would', 7954), ('time', 7586)]

# Check the index of a token in the vocabulary
print(TEXT.vocab.stoi["movie"])
# Output: 
# 3

# Check the token of an index in the vocabulary
print(TEXT.vocab.itos[3])
# Output: 
# movie

As you can see, the build_vocab method takes the train_data object as an argument, and builds the vocabulary for the text column. The vocab attribute of the TEXT field returns a Vocab object, which has various attributes and methods for accessing and manipulating the vocabulary. You can see that the size of the vocabulary is 101,636, which is the number of unique tokens in the text column. You can also see the most common tokens in the vocabulary, and their frequencies. You can also see how to get the index of a token, or the token of an index, using the stoi and itos attributes.

Now that you have tokenized and built the vocabulary for the text column, you can move on to the next step: encoding the text data.

5. Encoding Text Data

After tokenizing the text data, the final step is to encode the data. Encoding is the process of converting the tokens into numerical values, such as indices, one-hot vectors, or embeddings. Numerical values are the only type of data that can be processed by the neural network.

There are different ways to encode text data, depending on the level of representation and the dimensionality of the data. Some common encoding methods are:

Index encoding: This is the simplest and most common method, where each token is assigned a unique index based on the vocabulary. The text is then represented as a sequence of indices. The dimensionality of the data is equal to the vocabulary size.
One-hot encoding: This is a method where each token is represented as a vector of zeros and one, where the one corresponds to the index of the token in the vocabulary. The text is then represented as a matrix of one-hot vectors. The dimensionality of the data is also equal to the vocabulary size.
Embedding encoding: This is a more advanced and efficient method, where each token is represented as a low-dimensional vector of real values, which captures the semantic and syntactic features of the token. The text is then represented as a matrix of embedding vectors. The dimensionality of the data is much smaller than the vocabulary size, and can be learned from the data or pre-trained on a large corpus. Some examples of embedding encoding methods are word2vec, GloVe, and BERT.

The choice of the encoding method depends on the type and complexity of the text data, and the goal of the NLP task. For example, if you are working with a large and diverse text corpus, you might want to use an embedding encoding method, as it can reduce the dimensionality and sparsity of the data, and capture the meaning and context of the tokens better. However, if you are working with a small and simple text corpus, you might want to use an index or one-hot encoding method, as it can preserve the identity and frequency of the tokens better.

In this section, you will learn how to encode the IMDB movie reviews dataset using PyTorch. You will use the torchtext.data module, which provides various classes and functions for text data encoding. You will use the Field class, which defines how to encode a certain type of data, such as text or labels. You will also use some built-in methods and attributes of the Field class, such as numericalize, pad, init_token, eos_token, etc.

You have already defined the fields for the text and label columns of the dataset in the previous sections, and you have already used the use_vocab argument to specify whether to use a vocabulary for the data or not. You have set the use_vocab argument to True for the text column, and False for the label column. This means that the text column will be encoded using the vocabulary that you have built in the previous section, and the label column will be kept as a numerical value.

To apply the encoding to the data, you need to use the numericalize method of the Field class, which converts the tokens into numerical values, such as indices or one-hot vectors. You can also use the pad argument to specify the padding token, which is used to fill the shorter sequences to match the length of the longest sequence. You can also use the init_token and eos_token arguments to specify the initial and end-of-sequence tokens, which are used to mark the beginning and end of the sequences.

Let’s see how to use the numericalize method and the other arguments to encode the data:

# Define the device to run the tensors on
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Numericalize the data using the fields
train_data, valid_data, test_data = TabularDataset.splits(
    path=".", # The path where the files are located
    train="IMDB_train.csv", # The name of the train file
    validation="IMDB_valid.csv", # The name of the validation file
    test="IMDB_test.csv", # The name of the test file
    format="csv", # The format of the files
    fields=[("text", TEXT), ("label", LABEL)], # The fields for each column
    skip_header=True) # Skip the first row of the files

# Set the padding token to 
TEXT.pad_token = ""

# Set the initial token to 
TEXT.init_token = ""

# Set the end-of-sequence token to 
TEXT.eos_token = ""

# Create the iterators for each split
train_iterator, valid_iterator, test_iterator = torchtext.data.BucketIterator.splits(
    (train_data, valid_data, test_data), # The datasets for each split
    batch_size=64, # The batch size
    device=device) # The device to run the tensors on

The output should be three BucketIterator objects, one for each split. You can check the batch attribute of each object to see how the data looks like after encoding:

# Check the batch attribute of the first object
print(next(iter(train_iterator)).batch)
# Output: 
# tensor([[    2,     2,     2,  ...,     2,     2,     2],
#         [  108,   108,   108,  ...,   108,   108,   108],
#         [   11,    11,    11,  ...,    11,    11,    11],
#         ...,
#         [    3,     3,     3,  ...,     3,     3,     3],
#         [    1,     1,     1,  ...,     1,     1,     1],
#         [    1,     1,     1,  ...,     1,     1,     1]], device='cuda:0')

As you can see, the data has been encoded as a tensor of indices, where each row corresponds to a sequence, and each column corresponds to a batch. You can also see that the sequences have been padded with the index 1, which corresponds to the token, and have been marked with the indices 2 and 3, which correspond to the and tokens, respectively.

Now that you have encoded the data, you have completed the steps for working with text data using PyTorch. You are ready to use the data for your NLP task, which in this case is building a text classifier with PyTorch.

6. Using torchtext for Text Data

In the previous sections, you learned how to use PyTorch to preprocess, tokenize, and encode text data. However, these steps can be tedious and time-consuming, especially if you have to deal with different types of text data and different NLP tasks. Fortunately, there is a PyTorch library that can simplify and automate these steps for you: torchtext.

Torchtext is a PyTorch library that provides tools and datasets for text data processing. It allows you to easily load, preprocess, tokenize, and encode text data using built-in classes and methods. It also provides various datasets and pretrained embeddings for common NLP tasks, such as sentiment analysis, machine translation, text summarization, etc.

In this section, you will learn how to use torchtext to perform the same steps that you did in the previous sections, but with much less code and effort. You will use the following classes and methods from torchtext:

Field: a class that defines how to process a certain type of text data, such as reviews or labels. It allows you to specify the preprocessing, tokenization, and encoding methods, as well as other parameters, such as the vocabulary size, the padding token, the unknown token, etc.
TabularDataset: a class that represents a dataset in a tabular format, such as a CSV file. It allows you to load and split the dataset into train, validation, and test sets, and apply the fields to each column of the dataset.
BucketIterator: a class that provides an iterator over a dataset. It allows you to batch and shuffle the data, and sort the data by length to minimize the amount of padding.
build_vocab: a method that builds the vocabulary for a field. It allows you to specify the minimum frequency of the tokens, the maximum size of the vocabulary, and the pretrained embeddings to use.

Let’s see how to use these classes and methods to work with the IMDB movie reviews dataset that you used in the previous sections.

7. Building a Text Classifier with PyTorch

Now that you have learned how to use torchtext to load, preprocess, tokenize, and encode text data, you are ready to build a text classifier with PyTorch. A text classifier is a type of NLP model that assigns a label to a given text, based on its content and meaning. For example, a text classifier can predict the sentiment of a movie review, the topic of a news article, the category of a product description, etc.

In this section, you will learn how to build a simple text classifier with PyTorch using the IMDB movie reviews dataset that you used in the previous sections. You will use the following steps to build the text classifier:

Define the model architecture: you will use a recurrent neural network (RNN), which is a type of neural network that can process sequential data, such as text. You will use a long short-term memory (LSTM) layer, which is a type of RNN that can handle long-term dependencies and avoid the vanishing gradient problem. You will also use an embedding layer, which is a layer that maps the encoded tokens into a lower-dimensional vector space, and a linear layer, which is a layer that performs a linear transformation on the input.
Define the loss function and the optimizer: you will use the binary cross-entropy loss, which is a loss function that measures the difference between the predicted and the true labels, and the Adam optimizer, which is an optimizer that adapts the learning rate for each parameter and performs well for various NLP tasks.
Train the model: you will use the BucketIterator that you created in the previous section to iterate over the train and validation sets, and update the model parameters based on the loss and the optimizer.
Evaluate the model: you will use the accuracy metric, which is a metric that measures the percentage of correct predictions, and the test set to evaluate the performance of the model on unseen data.

Let’s see how to implement these steps in PyTorch.

8. Conclusion

In this blog, you learned how to use PyTorch to work with text data for NLP tasks. You learned how to:

Load and explore text data using pandas and torch.utils.data.
Preprocess text data using regular expressions and string methods.
Tokenize text data using spaCy and torchtext.
Encode text data using torchtext and pretrained embeddings.
Use torchtext to simplify and automate the text data processing steps.
Build a text classifier using PyTorch, torchtext, and an LSTM model.

By following these steps, you were able to transform the raw text data into a suitable representation for the neural network, and build a simple text classifier that can predict the sentiment of a movie review. You also learned how to use torchtext, a PyTorch library that provides tools and datasets for text data processing.

PyTorch is a powerful and flexible framework that allows you to build and train various types of neural networks for different NLP tasks. Torchtext is a useful library that simplifies and automates the text data processing steps. Together, they can help you to create effective and efficient NLP models with less code and effort.

We hope you enjoyed this blog and learned something new and useful. If you have any questions or feedback, please leave a comment below. Thank you for reading!