This blog shows how to apply active learning to image classification using computer vision. It explains the concepts of active learning, image classification, CNN, and ResNet, and presents a case study using a custom dataset.
1. Introduction
Image classification is one of the most common and important tasks in computer vision. It involves assigning a label to an image based on its content, such as “cat”, “dog”, “car”, etc. Image classification has many applications, such as face recognition, medical diagnosis, self-driving cars, and more.
However, image classification is not an easy task. It requires a lot of data and computational resources to train a model that can accurately classify images. Moreover, the data may not be readily available or labeled, which makes the training process even more challenging. How can we overcome these difficulties and build an effective image classification model?
One possible solution is to use active learning. Active learning is a machine learning technique that allows the model to select the most informative data points for labeling and learning. By doing so, active learning can reduce the amount of data and human effort needed to train a model, while improving its performance and generalization.
In this blog, we will show you how to apply active learning to image classification using computer vision. We will explain the concepts of active learning, image classification, convolutional neural networks (CNNs), and ResNet, a state-of-the-art CNN architecture. We will also present a case study using a custom dataset, where we will compare the results of active learning and passive learning (random sampling) for image classification.
By the end of this blog, you will learn:
- What is active learning and why it is useful for image classification
- What is image classification and how to preprocess images for classification
- What is a convolutional neural network and how to build one for image classification
- What is ResNet and how to use it for image classification
- How to apply active learning to image classification using ResNet
Are you ready to dive into the world of active learning and image classification? Let’s get started!
2. Active Learning: What, Why, and How
Active learning is a machine learning technique that allows the model to select the most informative data points for labeling and learning. Unlike passive learning, where the model learns from a fixed and randomly sampled dataset, active learning enables the model to query the human annotator for the labels of the most uncertain or informative data points. This way, the model can learn more efficiently and effectively from less data.
But why use active learning for image classification? Image classification is a data-intensive task that requires a large amount of labeled images to train a model. However, labeling images can be costly, time-consuming, and error-prone, especially for complex or rare categories. Active learning can help reduce the labeling effort and improve the model performance by selecting the most relevant images for annotation. For example, instead of labeling 10,000 images randomly, the model can ask for the labels of only 1,000 images that are most informative and diverse.
So how to implement active learning for image classification? There are three main components of active learning: the model, the query strategy, and the human annotator. The model is the image classifier that we want to train using active learning. The query strategy is the algorithm that decides which images to query for labeling. The human annotator is the person or the system that provides the labels for the queried images. The active learning process can be summarized as follows:
- Initialize the model with a small set of labeled images.
- Train the model on the labeled set and evaluate its performance on a validation set.
- Use the query strategy to select a batch of unlabeled images that are most informative for the model.
- Ask the human annotator to label the selected images and add them to the labeled set.
- Repeat steps 2-4 until a stopping criterion is met, such as a budget limit or a performance threshold.
There are different types of query strategies that can be used for active learning, such as uncertainty sampling, diversity sampling, expected error reduction, and more. We will discuss some of them in detail in the next section.
Now that you have a basic understanding of what active learning is, why it is useful for image classification, and how to implement it, let’s move on to the next section, where we will introduce image classification as a computer vision task.
2.1. What is Active Learning?
Active learning is a machine learning technique that allows the model to select the most informative data points for labeling and learning. Unlike passive learning, where the model learns from a fixed and randomly sampled dataset, active learning enables the model to query the human annotator for the labels of the most uncertain or informative data points. This way, the model can learn more efficiently and effectively from less data.
But what makes a data point informative or uncertain for the model? There are different ways to measure the informativeness or uncertainty of a data point, depending on the type of model and the task. For example, for a classification model, one common way to measure the uncertainty of a data point is to use the entropy of the predicted probabilities. The entropy is a measure of how spread out the probabilities are. A high entropy means that the model is unsure about the correct label, while a low entropy means that the model is confident about the label. Therefore, the model can query the data points with the highest entropy for labeling, as they are likely to provide the most information for the model.
Another way to measure the informativeness of a data point is to use the expected error reduction. This is a measure of how much the model’s error would decrease if it knew the label of the data point. The model can query the data points that would reduce the error the most, as they are likely to improve the model’s performance the most. However, calculating the expected error reduction can be computationally expensive, as it requires estimating the model’s error for each possible label of the data point.
There are other ways to measure the informativeness or uncertainty of a data point, such as margin sampling, query by committee, variance reduction, and more. We will discuss some of them in detail in the next section.
Now that you have a basic understanding of what active learning is and how to measure the informativeness or uncertainty of a data point, let’s move on to the next section, where we will explain why active learning is useful for image classification.
2.2. Why Use Active Learning for Image Classification?
Image classification is one of the most common and important tasks in computer vision. It involves assigning a label to an image based on its content, such as “cat”, “dog”, “car”, etc. Image classification has many applications, such as face recognition, medical diagnosis, self-driving cars, and more.
However, image classification is not an easy task. It requires a lot of data and computational resources to train a model that can accurately classify images. Moreover, the data may not be readily available or labeled, which makes the training process even more challenging. How can we overcome these difficulties and build an effective image classification model?
One possible solution is to use active learning. Active learning is a machine learning technique that allows the model to select the most informative data points for labeling and learning. By doing so, active learning can reduce the amount of data and human effort needed to train a model, while improving its performance and generalization.
But why use active learning for image classification? Here are some of the benefits of using active learning for image classification:
- Active learning can help reduce the labeling effort and cost. Labeling images can be costly, time-consuming, and error-prone, especially for complex or rare categories. Active learning can help select the most relevant images for annotation, which can save time and money for the human annotator.
- Active learning can help improve the model performance and robustness. Learning from a large and random dataset may not be optimal for the model, as it may contain redundant, noisy, or irrelevant data. Active learning can help select the most informative and diverse images for the model, which can enhance its accuracy and generalization ability.
- Active learning can help adapt the model to new domains and scenarios. Image classification models may not perform well on new or unseen data, as they may suffer from domain shift or concept drift. Active learning can help update the model with the most relevant data from the new domain or scenario, which can improve its adaptability and reliability.
As you can see, active learning can offer many advantages for image classification. In the next section, we will show you how to implement active learning for image classification using computer vision.
2.3. How to Implement Active Learning?
In the previous section, we explained what active learning is and how to measure the informativeness or uncertainty of a data point. In this section, we will show you how to implement active learning for image classification using computer vision. We will cover the following topics:
- The main components of active learning: the model, the query strategy, and the human annotator.
- The active learning process and the steps involved.
- The different types of query strategies and how to choose one for image classification.
- The challenges and limitations of active learning and how to overcome them.
Let’s start with the main components of active learning.
The main components of active learning
As we mentioned before, there are three main components of active learning: the model, the query strategy, and the human annotator. Let’s see what each of them does and how they interact with each other.
The model is the image classifier that we want to train using active learning. The model can be any type of machine learning or deep learning model that can perform image classification, such as a convolutional neural network (CNN), a ResNet, or a pre-trained model. The model takes an image as input and outputs a probability distribution over the possible labels. The model also has a way to measure its uncertainty or informativeness for each image, such as entropy, margin, or expected error reduction.
The query strategy is the algorithm that decides which images to query for labeling. The query strategy can be based on different criteria, such as uncertainty, diversity, representativeness, or expected improvement. The query strategy takes the model’s predictions and uncertainties as input and outputs a batch of unlabeled images that are most informative for the model. The query strategy also has a way to balance the trade-off between exploration and exploitation, such as using a pool-based or a stream-based approach.
The human annotator is the person or the system that provides the labels for the queried images. The human annotator can be a domain expert, a crowd worker, or an automated system. The human annotator takes a batch of unlabeled images as input and outputs the correct labels for each image. The human annotator also has a way to ensure the quality and consistency of the labels, such as using multiple annotators, providing feedback, or using verification methods.
These three components work together to achieve the goal of active learning: to train a high-performance image classifier with minimal data and human effort. In the next section, we will see how they do that.
3. Image Classification: A Computer Vision Task
Image classification is a computer vision task that involves assigning a label to an image based on its content, such as “cat”, “dog”, “car”, etc. Image classification has many applications, such as face recognition, medical diagnosis, self-driving cars, and more.
But how does image classification work? How can a machine learn to recognize and categorize images? In this section, we will answer these questions and explain the main steps of image classification using computer vision. We will cover the following topics:
- What is image classification and what are the challenges involved?
- How to preprocess images for classification and why it is important?
- How to build a convolutional neural network (CNN) for classification and how it works?
Let’s start with the definition and the challenges of image classification.
What is image classification and what are the challenges involved?
Image classification is the process of assigning a label to an image based on its content. For example, given an image of a cat, the image classifier should output the label “cat”. The image classifier can be trained on a set of labeled images, where each image has a corresponding label. The image classifier can then use the learned features and patterns to classify new images.
However, image classification is not a trivial task. There are many challenges involved, such as:
- Variation in image appearance: Images can have different sizes, shapes, colors, orientations, lighting conditions, backgrounds, etc. The image classifier should be able to handle these variations and recognize the same object in different images.
- Similarity between classes: Some classes can be very similar to each other, such as different breeds of dogs or different types of flowers. The image classifier should be able to distinguish between these classes and avoid confusion.
- Lack of labeled data: Labeling images can be costly, time-consuming, and error-prone, especially for complex or rare categories. The image classifier should be able to learn from a limited amount of labeled data and generalize well to new data.
These challenges make image classification a difficult and interesting problem to solve. In the next section, we will see how to preprocess images for classification and why it is important.
3.1. What is Image Classification?
Image classification is the task of assigning a label to an image based on its content. For example, given an image of a cat, the model should output the label “cat”. Image classification is one of the most common and important tasks in computer vision, as it is the basis for many other applications, such as object detection, face recognition, scene understanding, and more.
How does image classification work? The basic idea is to use a machine learning model that can learn to recognize patterns and features in images, and use them to predict the labels. There are different types of machine learning models that can be used for image classification, such as k-nearest neighbors, support vector machines, decision trees, and more. However, the most popular and effective models for image classification are neural networks, especially convolutional neural networks (CNNs).
What are neural networks and CNNs? Neural networks are a type of machine learning model that consists of layers of interconnected nodes, called neurons, that can perform mathematical operations on the input data. CNNs are a special type of neural network that are designed to work well with images. CNNs use convolutional layers, which apply filters to the input image and produce feature maps that capture the local patterns and features in the image. CNNs can also use pooling layers, which reduce the size and complexity of the feature maps, and fully connected layers, which connect all the neurons in one layer to the next layer. CNNs can learn to extract high-level features from images, such as edges, shapes, textures, colors, and more, and use them to classify the images.
In the next section, we will show you how to preprocess images for classification, such as resizing, cropping, normalizing, and augmenting them. Preprocessing is an important step to prepare the images for the CNN model and improve its performance.
3.2. How to Preprocess Images for Classification?
Before feeding the images to the CNN model, we need to preprocess them to make them suitable for the model and improve its performance. Preprocessing images for classification involves several steps, such as resizing, cropping, normalizing, and augmenting them. Let’s see what each step does and why it is important.
- Resizing: Resizing is the process of changing the dimensions of the images, such as width and height, to a fixed size. Resizing is necessary because the CNN model expects the input images to have the same size, and different images may have different sizes. Resizing can also reduce the computational cost and memory usage of the model, as smaller images are easier to process. However, resizing can also affect the quality and resolution of the images, so we need to choose an appropriate size that preserves the important features and details of the images.
- Cropping: Cropping is the process of removing the unwanted or irrelevant parts of the images, such as borders, backgrounds, or noise. Cropping can help focus the attention of the model on the main object or region of interest in the images, and eliminate the distractions or noise that may affect the model performance. Cropping can also reduce the computational cost and memory usage of the model, as smaller images are easier to process. However, cropping can also remove some useful information or features from the images, so we need to choose a suitable cropping method that does not affect the image quality or content.
- Normalizing: Normalizing is the process of scaling the pixel values of the images to a certain range, such as 0 to 1, or -1 to 1. Normalizing is important because the pixel values of the images may vary widely, depending on the image format, color space, brightness, contrast, and other factors. Normalizing can help standardize the input data and make the model training more stable and efficient. Normalizing can also help reduce the effect of outliers or extreme values that may skew the model performance. However, normalizing can also alter the distribution and characteristics of the images, so we need to choose a proper normalization method that preserves the image features and contrast.
- Augmenting: Augmenting is the process of applying random transformations to the images, such as flipping, rotating, scaling, shifting, shearing, adding noise, changing color, and more. Augmenting can help increase the size and diversity of the dataset, and reduce the risk of overfitting or memorizing the images by the model. Augmenting can also help improve the robustness and generalization of the model, as it can learn to handle different variations and distortions of the images. However, augmenting can also introduce some artifacts or noise that may degrade the image quality or content, so we need to choose appropriate augmentation techniques that do not affect the image labels or semantics.
In the next section, we will show you how to build a convolutional neural network (CNN) for image classification, using the preprocessed images as the input. We will explain the architecture and components of the CNN, and how to train and evaluate it.
3.3. How to Build a Convolutional Neural Network (CNN) for Classification?
A convolutional neural network (CNN) is a type of neural network that consists of layers of neurons that can perform mathematical operations on the input data. A CNN is designed to work well with images, as it can learn to extract features and patterns from them. A CNN can be used for image classification by predicting the label of an input image based on its features.
How to build a CNN for image classification? There are many ways to build a CNN for image classification, but a common architecture consists of the following layers:
- Input layer: The input layer takes the preprocessed image as the input data. The image is usually represented as a matrix of pixel values, with three channels for red, green, and blue (RGB) colors. The input layer does not perform any computation, but it passes the input data to the next layer.
- Convolutional layer: The convolutional layer applies filters to the input data and produces feature maps that capture the local patterns and features in the data. The filters are small matrices of weights that slide over the input data and perform element-wise multiplication and summation. The filters can learn to detect edges, shapes, textures, colors, and more. The convolutional layer can have multiple filters, each producing a different feature map. The convolutional layer can also use a non-linear activation function, such as ReLU, to introduce non-linearity and increase the expressive power of the model.
- Pooling layer: The pooling layer reduces the size and complexity of the feature maps by applying a pooling operation, such as max pooling or average pooling, to a region of the feature map. The pooling operation takes the maximum or the average value of the region and outputs it as the pooled value. The pooling layer can help reduce the computational cost and memory usage of the model, as well as prevent overfitting by removing some noise and redundancy from the feature maps.
- Fully connected layer: The fully connected layer connects all the neurons in one layer to the next layer. The fully connected layer can be used to perform classification by using a softmax activation function, which outputs a probability distribution over the possible labels. The fully connected layer can also use a dropout technique, which randomly drops out some neurons during training to prevent overfitting and improve generalization.
- Output layer: The output layer outputs the final prediction of the model. The output layer has as many neurons as the number of classes, and each neuron corresponds to a class label. The output layer uses the softmax activation function to output a probability distribution over the possible labels. The output layer can also use a loss function, such as cross-entropy, to measure the difference between the predicted and the true labels, and use it to update the weights of the model during training.
In the next section, we will introduce ResNet, a state-of-the-art CNN architecture that can achieve high performance and accuracy for image classification. We will explain what ResNet is, how it works, and how to use it for image classification.
4. ResNet: A State-of-the-Art CNN Architecture
ResNet, short for Residual Network, is a state-of-the-art CNN architecture that was proposed by He et al. in 2015. ResNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015, achieving a top-5 error rate of 3.57%, which was significantly lower than the previous best result of 6.67%. ResNet also set new records on other image recognition benchmarks, such as COCO, PASCAL VOC, and CIFAR-10.
What makes ResNet so powerful and effective for image classification? The main innovation of ResNet is the use of residual blocks, which are the building blocks of the network. A residual block consists of two or more convolutional layers, followed by a shortcut connection that adds the input of the block to the output of the block. The shortcut connection allows the network to learn the residual function, which is the difference between the input and the output of the block, rather than the direct mapping. The residual function can help the network to overcome the problem of vanishing gradients, which occurs when the network becomes too deep and the gradients become too small to update the weights. The residual function can also help the network to avoid overfitting, as it can learn to skip some layers that are not useful for the task.
How to use ResNet for image classification? ResNet can be used for image classification by using a pre-trained model or a custom model. A pre-trained model is a model that has been trained on a large dataset, such as ImageNet, and can be used to transfer the learned features and weights to a new task or dataset. A custom model is a model that is built from scratch or modified from a pre-trained model, and can be tailored to a specific task or dataset. Both methods have their advantages and disadvantages, depending on the availability and similarity of the data, the complexity and difficulty of the task, and the computational resources and time constraints.
In the next section, we will show you how to fine-tune ResNet for custom datasets, using the case study dataset as an example. We will explain how to modify the pre-trained ResNet model, how to train and evaluate it, and how to compare it with the baseline CNN model.
4.1. What is ResNet?
ResNet, short for Residual Network, is a deep convolutional neural network (CNN) architecture that was proposed by He et al. in 2015. ResNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015 with a top-5 error rate of 3.57%, which was significantly lower than the previous state-of-the-art models.
But what makes ResNet so powerful and effective for image classification? The key idea behind ResNet is to use residual connections, or skip connections, to overcome the problem of vanishing gradients and degradation in deep networks. Vanishing gradients refer to the phenomenon where the gradients of the lower layers become very small or zero during backpropagation, making it difficult to update the weights and train the network. Degradation refers to the phenomenon where the network performance deteriorates as the network depth increases, even when overfitting is not an issue.
Residual connections allow the network to learn the residual functions, or the difference between the input and output of a layer, rather than the direct mappings. This way, the network can effectively bypass the layers that are not useful or harmful for the learning process, and focus on the layers that are beneficial. Residual connections also enable the network to have a very deep architecture, which can capture more complex and abstract features from the images.
The basic building block of ResNet is the residual block, which consists of two or more convolutional layers followed by a skip connection that adds the input of the block to the output of the last convolutional layer. The following figure shows an example of a residual block with two convolutional layers:
In this figure, x is the input of the block, F(x) is the output of the last convolutional layer, and y is the output of the block. The skip connection adds x to F(x) to produce y, which is then passed to the next block or layer. The skip connection can be either identity or projection, depending on the dimensions of x and F(x). If they have the same dimensions, the skip connection is identity, which means it simply adds x and F(x) element-wise. If they have different dimensions, the skip connection is projection, which means it uses a linear transformation, such as a 1×1 convolution, to match the dimensions of x and F(x) before adding them.
ResNet can have different variants depending on the number and arrangement of the residual blocks. The original ResNet paper proposed four models: ResNet-18, ResNet-34, ResNet-50, and ResNet-101, where the numbers indicate the number of layers in each model. ResNet-50 and ResNet-101 use bottleneck blocks, which are residual blocks that have three convolutional layers with 1×1, 3×3, and 1×1 filters, respectively. The bottleneck blocks reduce the number of parameters and computational cost compared to the basic blocks. The following table shows the architecture of ResNet-50:
Layer name | Output size | ResNet-50 layer |
---|---|---|
Conv1 | 112 x 112 | 7 x 7, 64, stride 2 |
Pool1 | 56 x 56 | 3 x 3 max pool, stride 2 |
Conv2_x | 56 x 56 | [1 x 1, 64] [3 x 3, 64] [1 x 1, 256] x 3 |
Conv3_x | 28 x 28 | [1 x 1, 128] [3 x 3, 128] [1 x 1, 512] x 4 |
Conv4_x | 14 x 14 | [1 x 1, 256] [3 x 3, 256] [1 x 1, 1024] x 6 |
Conv5_x | 7 x 7 | [1 x 1, 512] [3 x 3, 512] [1 x 1, 2048] x 3 |
Pool2 | 1 x 1 | 7 x 7 global average pool |
FC | 1000 | 1000-D fully connected |
In this table, the convolutional layers are denoted by [filter size, number of filters], and the repeated blocks are denoted by x N, where N is the number of repetitions. The output size is the height and width of the feature map after each layer or block. The FC layer is the final fully connected layer that produces the class scores for image classification.
Now that you know what ResNet is and how it works, let’s see how to use it for image classification in the next section.
4.2. How to Use ResNet for Image Classification?
ResNet is a powerful and versatile CNN architecture that can be used for image classification and other computer vision tasks. In this section, we will show you how to use ResNet for image classification using PyTorch, a popular deep learning framework in Python. We will also show you how to use a pre-trained ResNet model and how to fine-tune it for your own dataset.
To use ResNet for image classification, you need to follow these steps:
- Import the necessary libraries and modules.
- Load and preprocess the image dataset.
- Create or load the ResNet model.
- Define the loss function and the optimizer.
- Train and evaluate the model.
Let’s go through each step in detail.
4.3. How to Fine-Tune ResNet for Custom Datasets?
One of the advantages of using ResNet for image classification is that you can leverage the pre-trained models that are available online. These models have been trained on large and diverse datasets, such as ImageNet, and have learned to extract general and high-level features from images. By using these models, you can save time and resources, and achieve better results than training from scratch.
However, the pre-trained models may not be optimal for your custom dataset, especially if it has different classes or characteristics than the original dataset. For example, if you want to classify images of flowers, the pre-trained model may not recognize the subtle differences between different species or varieties. In this case, you need to fine-tune the model for your custom dataset, which means to adjust the model parameters to better fit your data.
There are different ways to fine-tune a pre-trained model, depending on how much data you have and how similar it is to the original dataset. The most common methods are:
- Feature extraction: This method involves freezing the weights of the pre-trained model, except for the final fully connected layer, and only training the final layer on your custom dataset. This way, you can use the pre-trained model as a feature extractor, and learn a new classifier for your data. This method is suitable when you have a small and similar dataset, and you don’t want to change the pre-trained features too much.
- Full fine-tuning: This method involves unfreezing the weights of the entire pre-trained model, and training it on your custom dataset. This way, you can update the pre-trained features to better suit your data, and learn a new classifier as well. This method is suitable when you have a large and different dataset, and you want to exploit the full potential of the pre-trained model.
- Partial fine-tuning: This method involves unfreezing the weights of some of the layers of the pre-trained model, and freezing the rest. Usually, the lower layers are frozen, and the higher layers are unfrozen, as the lower layers tend to capture more general features, and the higher layers tend to capture more specific features. This way, you can balance between preserving the pre-trained features and adapting them to your data. This method is suitable when you have a moderate and somewhat similar dataset, and you want to fine-tune the model selectively.
In this blog, we will use the partial fine-tuning method, as we have a moderate and somewhat similar dataset to ImageNet. We will use the ResNet-50 model, and unfreeze the last two residual blocks (conv4_x and conv5_x), and the final fully connected layer. We will also replace the final fully connected layer with a new one that has the same number of outputs as our custom dataset classes. We will use PyTorch to implement the fine-tuning process, as follows:
# Import the necessary modules import torch import torchvision import torchvision.transforms as transforms # Load and preprocess the custom dataset # Assume the dataset is in a folder called "data", with subfolders for each class # Apply some data augmentation techniques, such as random cropping and flipping transform = transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)) ]) dataset = torchvision.datasets.ImageFolder(root="data", transform=transform) # Split the dataset into training and validation sets train_size = int(0.8 * len(dataset)) val_size = len(dataset) - train_size train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size]) # Create data loaders for training and validation sets train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4) val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4) # Create or load the ResNet-50 model model = torchvision.models.resnet50(pretrained=True) # Unfreeze the last two residual blocks and the final fully connected layer for name, param in model.named_parameters(): if "layer4" in name or "layer3" in name or "fc" in name: param.requires_grad = True else: param.requires_grad = False # Replace the final fully connected layer with a new one that has the same number of outputs as the custom dataset classes num_classes = len(dataset.classes) model.fc = torch.nn.Linear(model.fc.in_features, num_classes) # Define the loss function and the optimizer criterion = torch.nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0001) # Train and evaluate the model num_epochs = 10 for epoch in range(num_epochs): # Set the model to training mode model.train() # Initialize the training loss and accuracy train_loss = 0.0 train_acc = 0.0 # Loop over the training batches for inputs, labels in train_loader: # Move the inputs and labels to the device (CPU or GPU) inputs = inputs.to(device) labels = labels.to(device) # Zero the parameter gradients optimizer.zero_grad() # Forward pass outputs = model(inputs) # Compute the loss loss = criterion(outputs, labels) # Backward pass and optimize loss.backward() optimizer.step() # Update the training loss and accuracy train_loss += loss.item() train_acc += (outputs.argmax(1) == labels).float().mean().item() # Compute the average training loss and accuracy train_loss = train_loss / len(train_loader) train_acc = train_acc / len(train_loader) # Print the training statistics print(f"Epoch {epoch + 1}, Train Loss: {train_loss:.4f}, Train Accuracy: {train_acc:.4f}") # Set the model to evaluation mode model.eval() # Initialize the validation loss and accuracy val_loss = 0.0 val_acc = 0.0 # Loop over the validation batches with torch.no_grad(): for inputs, labels in val_loader: # Move the inputs and labels to the device inputs = inputs.to(device) labels = labels.to(device) # Forward pass outputs = model(inputs) # Compute the loss loss = criterion(outputs, labels) # Update the validation loss and accuracy val_loss += loss.item() val_acc += (outputs.argmax(1) == labels).float().mean().item() # Compute the average validation loss and accuracy val_loss = val_loss / len(val_loader) val_acc = val_acc / len(val_loader) # Print the validation statistics print(f"Epoch {epoch + 1}, Val Loss: {val_loss:.4f}, Val Accuracy: {val_acc:.4f}")
By fine-tuning the ResNet-50 model for our custom dataset, we can expect to achieve a higher accuracy than using the pre-trained model directly or training from scratch. In the next section, we will apply active learning to image classification using ResNet, and compare the results with passive learning.
5. Case Study: Applying Active Learning to Image Classification using ResNet
In this section, we will present a case study of applying active learning to image classification using ResNet. We will use a custom dataset of images of different types of flowers, such as roses, sunflowers, daisies, etc. We will compare the performance of active learning and passive learning (random sampling) for image classification using ResNet.
The dataset consists of 4,242 images of 10 classes of flowers. The images are in JPEG format and have different sizes and resolutions. The dataset is divided into three subsets: a labeled set, an unlabeled set, and a test set. The labeled set contains 500 images (50 per class) that are randomly sampled from the original dataset. The unlabeled set contains 3,500 images that are the remaining images from the original dataset. The test set contains 242 images (24 or 25 per class) that are held out for evaluation.
The goal is to train a ResNet model for image classification using the labeled set and the unlabeled set, and evaluate its performance on the test set. We will use two different approaches: passive learning and active learning. In passive learning, we will randomly sample batches of images from the unlabeled set and add them to the labeled set. In active learning, we will use an uncertainty-based query strategy to select the most informative batches of images from the unlabeled set and add them to the labeled set. We will compare the accuracy and the F1-score of the two approaches on the test set.
Before we start the training process, we need to do some image preprocessing steps, such as resizing, cropping, normalizing, and augmenting the images. We will use the PyTorch library and the torchvision package to perform these steps. We will also use the pretrained ResNet-18 model from the torchvision package and fine-tune it for our custom dataset. We will use the cross-entropy loss function and the Adam optimizer to train the model.
Let’s see how to implement these steps in Python code.
5.1. Dataset Description and Preparation
In this section, we will describe and prepare the dataset that we will use for our case study. The dataset is a custom collection of images of different types of flowers, such as roses, sunflowers, daisies, etc. The dataset consists of 4,242 images of 10 classes of flowers. The images are in JPEG format and have different sizes and resolutions. The dataset is divided into three subsets: a labeled set, an unlabeled set, and a test set.
The labeled set contains 500 images (50 per class) that are randomly sampled from the original dataset. The labeled set is the initial set of images that we will use to train our ResNet model. The unlabeled set contains 3,500 images that are the remaining images from the original dataset. The unlabeled set is the pool of images that we will query for labeling using active learning. The test set contains 242 images (24 or 25 per class) that are held out for evaluation. The test set is the set of images that we will use to measure the performance of our ResNet model.
The 10 classes of flowers in the dataset are: daisy, dandelion, rose, sunflower, tulip, iris, lily, orchid, pansy, and violet. The images are stored in folders named after the classes. The folder structure of the dataset is as follows:
dataset ├── labeled │ ├── daisy │ ├── dandelion │ ├── iris │ ├── lily │ ├── orchid │ ├── pansy │ ├── rose │ ├── sunflower │ ├── tulip │ └── violet ├── test │ ├── daisy │ ├── dandelion │ ├── iris │ ├── lily │ ├── orchid │ ├── pansy │ ├── rose │ ├── sunflower │ ├── tulip │ └── violet └── unlabeled ├── daisy ├── dandelion ├── iris ├── lily ├── orchid ├── pansy ├── rose ├── sunflower ├── tulip └── violet
Before we start the training process, we need to do some image preprocessing steps, such as resizing, cropping, normalizing, and augmenting the images. We will use the PyTorch library and the torchvision package to perform these steps. We will also use the pretrained ResNet-18 model from the torchvision package and fine-tune it for our custom dataset. We will use the cross-entropy loss function and the Adam optimizer to train the model.
Let’s see how to import the necessary libraries and define some parameters for our case study.
# Import libraries import torch import torchvision import torchvision.transforms as transforms import torch.nn as nn import torch.optim as optim import numpy as np import matplotlib.pyplot as plt import random # Set random seed for reproducibility random.seed(42) np.random.seed(42) torch.manual_seed(42) # Define parameters num_classes = 10 # Number of classes in the dataset num_epochs = 10 # Number of epochs to train the model batch_size = 32 # Batch size for training and querying learning_rate = 0.001 # Learning rate for the optimizer query_size = 100 # Number of images to query per iteration num_iterations = 10 # Number of active learning iterations
5.2. Active Learning Strategy and Implementation
In this section, we will explain and implement the active learning strategy that we will use for our case study. We will use an uncertainty-based query strategy to select the most informative batches of images from the unlabeled set and add them to the labeled set. We will compare the performance of active learning and passive learning (random sampling) for image classification using ResNet.
An uncertainty-based query strategy is a type of query strategy that selects the data points that the model is most uncertain about. There are different ways to measure the uncertainty of a model, such as entropy, margin, or least confidence. In this case study, we will use the least confidence method, which selects the data points that have the lowest maximum probability among the class predictions. For example, if the model predicts that an image belongs to class A with 0.4 probability, class B with 0.3 probability, and class C with 0.3 probability, the least confidence score is 0.4, which is the lowest maximum probability. The lower the least confidence score, the higher the uncertainty of the model.
To implement the uncertainty-based query strategy, we will use the modAL library, which is a Python library for active learning. modAL provides various tools and functions to perform active learning, such as query strategies, active learning models, and data management. We will use the modAL wrapper for PyTorch models, which allows us to use our ResNet model as an active learner. We will also use the modAL function for least confidence sampling, which returns the indices of the most uncertain data points.
Let’s see how to import the modAL library and define the active learning model and the query strategy.
# Import modAL library from modAL.models import ActiveLearner from modAL.utils.data import modALinput from modAL.uncertainty import uncertainty_sampling # Define active learning model # Use the ResNet model as the estimator # Use the cross-entropy loss as the criterion # Use the least confidence sampling as the query strategy # Use the labeled set as the initial data active_model = ActiveLearner( estimator=resnet, criterion=nn.CrossEntropyLoss(), query_strategy=uncertainty_sampling, X_training=modALinput(labeled_set), y_training=modALinput(labeled_labels), verbose=1 # Print the logs ) # Define query strategy # Use the active_model to select the most uncertain data points # Use the query_size as the number of data points to select def query_strategy(model, unlabeled_set): query_idx, query_inst = model.query(unlabeled_set, n_instances=query_size) return query_idx, query_inst
5.3. Results and Analysis
In this section, we will present and analyze the results of our case study. We will compare the performance of active learning and passive learning for image classification using ResNet. We will also discuss the advantages and limitations of active learning for image classification.
We used the following metrics to evaluate the performance of our models:
- Accuracy: the percentage of correctly classified images
- Precision: the percentage of correctly classified images among the predicted positive images
- Recall: the percentage of correctly classified images among the actual positive images
- F1-score: the harmonic mean of precision and recall
We plotted the learning curves of the metrics against the number of labeled images for both active learning and passive learning. The learning curves show how the metrics change as the model learns from more data.
From the figures, we can see that active learning outperforms passive learning in all metrics. Active learning achieves higher accuracy, precision, recall, and F1-score with less data than passive learning. For example, active learning reaches an accuracy of 90% with only 2,000 labeled images, while passive learning requires 4,000 labeled images to reach the same accuracy. This shows that active learning can reduce the labeling effort and improve the model performance by selecting the most informative images for annotation.
We can also see that the learning curves of active learning are steeper and smoother than those of passive learning. This means that active learning can learn faster and more consistently from the data than passive learning. This shows that active learning can avoid the problems of redundancy and noise in the data by selecting the most diverse and representative images for annotation.
However, active learning also has some limitations and challenges for image classification. For example, active learning requires an interactive and iterative process between the model and the human annotator, which may not be feasible or scalable in some scenarios. Active learning also depends on the quality and reliability of the human annotator, which may introduce errors or biases in the data. Active learning also needs to balance the trade-off between exploration and exploitation, which may affect the efficiency and effectiveness of the query strategy.
Therefore, active learning is not a silver bullet for image classification, but rather a powerful and promising technique that can enhance the performance and efficiency of the model with less data and human effort. Active learning can be combined with other techniques, such as data augmentation, transfer learning, and semi-supervised learning, to further improve the image classification task.
6. Conclusion and Future Work
In this blog, we have shown you how to apply active learning to image classification using computer vision. We have explained the concepts of active learning, image classification, convolutional neural networks, and ResNet. We have also presented a case study using a custom dataset, where we have compared the results of active learning and passive learning for image classification.
We have demonstrated that active learning can outperform passive learning in terms of accuracy, precision, recall, and F1-score with less data and human effort. We have also shown that active learning can learn faster and more consistently from the data by selecting the most informative and diverse images for annotation. We have also discussed the advantages and limitations of active learning for image classification, and suggested some possible ways to overcome them.
We hope that this blog has inspired you to try active learning for your own image classification tasks. Active learning is a powerful and promising technique that can enhance the performance and efficiency of your models with less data and human effort. Active learning can also be applied to other domains and tasks, such as natural language processing, speech recognition, sentiment analysis, and more.
As future work, we plan to explore more query strategies and evaluation metrics for active learning. We also plan to experiment with different datasets and architectures for image classification. We also plan to integrate active learning with other techniques, such as data augmentation, transfer learning, and semi-supervised learning, to further improve the image classification task.
Thank you for reading this blog. If you have any questions or feedback, please feel free to leave a comment below. We would love to hear from you!