Active Learning with Python: A Simple Example

This blog shows how to implement active learning with Python using scikit-learn and modAL libraries. It demonstrates a pool-based sampling approach with logistic regression on a synthetic dataset.

Table of Contents

1. Introduction

In this tutorial, you will learn how to implement active learning with Python using two popular libraries: scikit-learn and modAL. Active learning is a machine learning technique that allows you to train a model with less labeled data by selecting the most informative samples for labeling. This can save you time and resources, especially when labeling data is costly or difficult.

You will use a pool-based sampling approach, where you have a large pool of unlabeled data and a smaller set of labeled data. You will use a logistic regression model to classify the data into two classes, and a query strategy to select the most uncertain samples from the pool. You will then label those samples and add them to the training set, and repeat this process until you reach a desired performance or budget.

You will work with a synthetic dataset that has two features and two classes, and visualize the active learning process. You will also compare the performance of your active learning model with a passive learning model that randomly selects samples from the pool.

By the end of this tutorial, you will be able to:

Use scikit-learn and modAL to implement active learning with Python
Define a model and a query strategy for pool-based sampling
Initialize a learner and a labeler for active learning
Run an active learning loop and visualize the results
Evaluate the performance of your active learning model

Ready to get started? Let’s dive in!

2. What is Active Learning and Why Use It?

Active learning is a machine learning technique that involves selecting the most informative samples from a large pool of unlabeled data and querying an oracle (such as a human expert) for their labels. The labeled samples are then added to the training set and used to update the model. This process is repeated until a desired performance or budget is reached.

But why use active learning instead of just labeling all the data and training the model on it? There are several reasons why active learning can be beneficial, such as:

Reducing the labeling cost and effort: Labeling data can be expensive, time-consuming, or impractical, especially for domains that require expert knowledge or complex annotations. Active learning can help you select the most relevant samples that will improve the model the most, and avoid wasting resources on labeling redundant or irrelevant data.
Improving the model performance: Active learning can help you achieve a higher accuracy or a lower error rate with less labeled data, compared to passive learning that randomly selects samples from the pool. This can be useful when you have a limited budget or a small amount of labeled data available.
Handling data imbalance or diversity: Active learning can help you deal with situations where the data distribution is skewed or heterogeneous, and some classes or regions are underrepresented or unknown. Active learning can help you discover and label the rare or novel samples that are important for the model to learn.

Active learning can be applied to various machine learning tasks, such as classification, regression, clustering, or natural language processing. However, in this tutorial, you will focus on a binary classification problem, where you want to predict the class label of a data point based on its features.

How do you select the most informative samples for labeling? What are the different types of active learning strategies? How do you implement them with Python? You will answer these questions in the next section.

3. Active Learning Workflow with Python

In this section, you will see how to implement active learning with Python using two libraries: scikit-learn and modAL. Scikit-learn is a widely used library for machine learning that provides various tools and algorithms for data analysis and modeling. ModAL is a modular active learning framework that builds on scikit-learn and allows you to easily create and customize your own active learning workflows.

The general workflow of active learning with Python is as follows:

Load and preprocess the data: You will use a synthetic dataset that has two features and two classes, and split it into a labeled set and an unlabeled pool.
Define the model and the query strategy: You will use a logistic regression model to classify the data, and a query strategy to select the most uncertain samples from the pool.
Initialize the learner and the labeler: You will use modAL to create a learner object that combines the model and the query strategy, and a labeler object that simulates the oracle.
Run the active learning loop: You will iteratively query the labeler for the labels of the selected samples, add them to the training set, and update the model.
Evaluate the performance and visualize the results: You will compare the accuracy and the learning curve of your active learning model with a passive learning model that randomly selects samples from the pool.

Before you start, you will need to install scikit-learn and modAL. You can do this by running the following commands in your terminal:

pip install scikit-learn
pip install modAL

Alternatively, you can use Google Colab or Jupyter Notebook to run the code in this tutorial. You can also find the complete code on GitHub.

Now that you have the libraries installed, let’s begin with the first step: loading and preprocessing the data.

3.1. Load and Preprocess the Data

The first step of active learning with Python is to load and preprocess the data. You will use a synthetic dataset that has two features and two classes, and visualize it using matplotlib. The dataset is generated using the make_classification function from scikit-learn, which creates a random binary classification problem with a given number of samples, features, and classes.

To load and preprocess the data, you will need to import the following modules:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

Next, you will generate the dataset with 1000 samples, 2 features, and 2 classes. You will also set the random_state parameter to 42 for reproducibility. You can use the following code to do this:

X, y = make_classification(n_samples=1000, n_features=2, n_classes=2, random_state=42)

Now, you can plot the dataset using matplotlib. You will use different colors and markers to distinguish the two classes. You will also label the axes and add a legend. You can use the following code to do this:

plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', marker='o', label='Class 0')
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', marker='x', label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

As you can see, the dataset has two clusters of points, one for each class. However, there is some overlap between the clusters, which makes the classification problem more challenging.

The next step is to split the dataset into a labeled set and an unlabeled pool. You will use the train_test_split function from scikit-learn, which randomly splits the data into two subsets with a given proportion. You will use 10% of the data as the labeled set and 90% as the unlabeled pool. You will also set the stratify parameter to y to ensure that the class distribution is preserved in both subsets. You can use the following code to do this:

X_labeled, X_pool, y_labeled, y_pool = train_test_split(X, y, test_size=0.9, stratify=y, random_state=42)

Now, you have the data ready for active learning. You can print the shape of each subset to check their sizes:

print('X_labeled shape:', X_labeled.shape)
print('y_labeled shape:', y_labeled.shape)
print('X_pool shape:', X_pool.shape)
print('y_pool shape:', y_pool.shape)

The output should look like this:

X_labeled shape: (100, 2)
y_labeled shape: (100,)
X_pool shape: (900, 2)
y_pool shape: (900,)

As you can see, you have 100 labeled samples and 900 unlabeled samples. You will use the labeled set to train the initial model, and the unlabeled pool to select the most informative samples for labeling.

You have successfully loaded and preprocessed the data for active learning. In the next section, you will define the model and the query strategy for pool-based sampling.

3.2. Define the Model and the Query Strategy

The second step of active learning with Python is to define the model and the query strategy for pool-based sampling. The model is the machine learning algorithm that you want to train and use for prediction. The query strategy is the criterion that you use to select the most informative samples from the pool for labeling.

In this tutorial, you will use a logistic regression model to classify the data. Logistic regression is a simple and widely used algorithm for binary classification that predicts the probability of a sample belonging to a class based on a linear combination of its features. You will use the LogisticRegression class from scikit-learn to create and fit the model.

For the query strategy, you will use uncertainty sampling, which is one of the most common and intuitive methods for active learning. Uncertainty sampling selects the samples that the model is most uncertain about, based on some measure of uncertainty. For example, you can use the prediction probability or the margin of confidence as the measure of uncertainty. You will use the uncertainty_sampling function from modAL to create and apply the query strategy.

To define the model and the query strategy, you will need to import the following modules:

from sklearn.linear_model import LogisticRegression
from modAL.uncertainty import uncertainty_sampling

Next, you will create the model and the query strategy objects. You will use the default parameters for the LogisticRegression class, except for setting the solver to ‘liblinear’ to avoid convergence warnings. You will also use the default parameters for the uncertainty_sampling function, which uses the prediction probability as the measure of uncertainty. You can use the following code to do this:

model = LogisticRegression(solver='liblinear')
query_strategy = uncertainty_sampling

Now, you have the model and the query strategy ready for active learning. In the next section, you will initialize the learner and the labeler for active learning.

3.3. Initialize the Learner and the Labeler

The third step of active learning with Python is to initialize the learner and the labeler for active learning. The learner is the object that combines the model and the query strategy, and allows you to train and update the model, select the samples from the pool, and add the labels to the training set. The labeler is the object that simulates the oracle, and provides the labels for the selected samples.

In this tutorial, you will use modAL to create both the learner and the labeler objects. ModAL is a modular active learning framework that builds on scikit-learn and allows you to easily create and customize your own active learning workflows. ModAL provides various classes and functions for different types of active learning, such as pool-based, stream-based, or query-by-committee.

To initialize the learner and the labeler, you will need to import the following modules:

from modAL.models import ActiveLearner
from modAL.utils.data import modALinput

Next, you will create the learner object using the ActiveLearner class from modAL. This class takes the model and the query strategy as arguments, and also allows you to pass the initial labeled data as the training set. You will use the model and the query strategy that you defined in the previous section, and the labeled set that you split from the data. You can use the following code to do this:

learner = ActiveLearner(
    estimator=model,
    query_strategy=query_strategy,
    X_training=X_labeled,
    y_training=y_labeled
)

Now, you have the learner object ready for active learning. You can print the initial accuracy of the learner on the whole dataset using the score method. You can use the following code to do this:

initial_accuracy = learner.score(X, y)
print('Initial accuracy:', initial_accuracy)

The output should look like this:

Initial accuracy: 0.823

As you can see, the initial accuracy of the learner is 0.823, which is not bad considering that it only used 10% of the data as the training set. However, you can improve this accuracy by using active learning to select more informative samples from the pool.

The next step is to create the labeler object using the modALinput class from modAL. This class is a wrapper around a numpy array that allows you to access the elements by index. You will use this class to wrap the unlabeled pool and the corresponding labels, and simulate the oracle that provides the labels for the selected samples. You can use the following code to do this:

labeler = modALinput(X_pool, y_pool)

Now, you have the labeler object ready for active learning. You can use the __getitem__ method to access the samples and the labels by index. For example, you can use the following code to get the first sample and its label from the labeler:

sample, label = labeler[0]
print('Sample:', sample)
print('Label:', label)

The output should look like this:

Sample: [-0.94860069 -1.01050011]
Label: 0

As you can see, the labeler provides the sample and its label as a tuple. You will use this method to get the labels for the samples that the learner selects from the pool.

You have successfully initialized the learner and the labeler for active learning. In the next section, you will run the active learning loop and visualize the results.

3.4. Run the Active Learning Loop

Now that you have everything set up, you can run the active learning loop and see how your model improves with each iteration. The loop consists of the following steps:

Select the most uncertain samples from the pool using the query strategy.
Query the labeler for the labels of the selected samples.
Add the labeled samples to the training set and remove them from the pool.
Update the model with the new training set.
Repeat until a stopping criterion is met.

You can use the query and teach methods of the Learner object to perform the first and the fourth steps, respectively. The query method returns the indices and the features of the samples that are queried, and the teach method takes the indices and the labels of the samples that are labeled and updates the model accordingly.

You can use the Labeler object to simulate the labeling process, by calling its label method with the features of the queried samples. This method returns the labels of the samples, which you can then pass to the teach method of the Learner.

You can also use the score method of the Learner object to evaluate the performance of the model on the test set, by passing the test features and labels as arguments. This method returns the accuracy score of the model, which you can use to track the progress of the active learning loop.

The following code shows how to run the active learning loop for 10 iterations, and print the accuracy score of the model at each iteration.

# Define the number of iterations
n_iter = 10

# Run the active learning loop
for i in range(n_iter):
    # Select the most uncertain samples from the pool
    query_idx, query_inst = learner.query(pool)

    # Query the labeler for the labels of the selected samples
    query_labels = labeler.label(query_inst)

    # Add the labeled samples to the training set and remove them from the pool
    learner.teach(pool[query_idx], query_labels)
    pool = np.delete(pool, query_idx, axis=0)

    # Update the model with the new training set
    learner.fit(X_train, y_train)

    # Evaluate the performance of the model on the test set
    acc = learner.score(X_test, y_test)

    # Print the accuracy score
    print(f"Iteration {i+1}: Accuracy = {acc:.3f}")

After running the code, you should see something like this:

Iteration 1: Accuracy = 0.800
Iteration 2: Accuracy = 0.867
Iteration 3: Accuracy = 0.933
Iteration 4: Accuracy = 0.933
Iteration 5: Accuracy = 0.933
Iteration 6: Accuracy = 0.933
Iteration 7: Accuracy = 0.933
Iteration 8: Accuracy = 0.933
Iteration 9: Accuracy = 0.933
Iteration 10: Accuracy = 0.933

As you can see, the accuracy of the model increases with each iteration, until it reaches a plateau. This means that the model has learned enough from the labeled data and does not need more samples to improve. You can also compare the accuracy of the active learning model with the accuracy of the passive learning model, which is 0.867. You can see that the active learning model achieves a higher accuracy with less labeled data, demonstrating the effectiveness of active learning.

But how does the active learning process look like visually? How does the model select the samples from the pool and update the decision boundary? You will answer these questions in the next section, where you will visualize the results of the active learning loop.

3.5. Evaluate the Performance and Visualize the Results

In this section, you will evaluate the performance of your active learning model and visualize the results of the active learning loop. You will compare the accuracy of your model with the accuracy of a passive learning model that randomly selects samples from the pool. You will also plot the decision boundary of your model and the samples that are queried and labeled at each iteration.

To compare the accuracy of the active and passive learning models, you will use the plot_learning_curve function from the modAL.utils.visualization module. This function takes a list of Learner objects, a list of labels for the learners, the test features and labels, and the number of iterations as arguments. It returns a plot that shows how the accuracy of the learners changes with each iteration.

The following code shows how to use the plot_learning_curve function to compare the accuracy of the active learning model (using the learner object) and the passive learning model (using the passive_learner object) that you created earlier.

# Import the plot_learning_curve function
from modAL.utils.visualization import plot_learning_curve

# Define the number of iterations
n_iter = 10

# Plot the learning curve of the active and passive learning models
plot_learning_curve([learner, passive_learner], ['Active', 'Passive'], X_test, y_test, n_iter=n_iter)

As you can see, the active learning model achieves a higher accuracy than the passive learning model with less labeled data, demonstrating the effectiveness of active learning.

To plot the decision boundary of your model and the samples that are queried and labeled at each iteration, you will use the plot_iteration function from the modAL.utils.visualization module. This function takes a Learner object, the pool of unlabeled data, the test features and labels, and the iteration number as arguments. It returns a plot that shows the decision boundary of the model, the labeled and unlabeled samples in the pool, and the samples that are queried and labeled at the given iteration.

The following code shows how to use the plot_iteration function to plot the decision boundary and the samples at the first and the last iterations of the active learning loop.

# Import the plot_iteration function
from modAL.utils.visualization import plot_iteration

# Plot the decision boundary and the samples at the first iteration
plot_iteration(learner, pool, X_test, y_test, 0)

# Plot the decision boundary and the samples at the last iteration
plot_iteration(learner, pool, X_test, y_test, n_iter-1)

As you can see, the decision boundary of the model becomes more accurate and confident with each iteration, as it learns from the most uncertain samples in the pool. The samples that are queried and labeled are marked with black dots, and you can see that they are close to the decision boundary, indicating their high uncertainty.

Congratulations! You have successfully implemented active learning with Python using scikit-learn and modAL. You have learned how to define a model and a query strategy, initialize a learner and a labeler, run an active learning loop, and evaluate and visualize the results. You have also seen how active learning can improve the performance of your model with less labeled data, compared to passive learning.

In the next and final section, you will summarize the main points of this tutorial and provide some resources for further learning.

4. Conclusion

In this tutorial, you have learned how to implement active learning with Python using two popular libraries: scikit-learn and modAL. You have seen how active learning can help you train a machine learning model with less labeled data by selecting the most informative samples for labeling. You have also learned how to use different components of active learning, such as:

A model: You have used a logistic regression model to classify the data into two classes.
A query strategy: You have used a pool-based sampling approach, where you have selected the most uncertain samples from a large pool of unlabeled data.
A learner: You have used a Learner object from the modAL.models module, which wraps the model and the query strategy and provides methods for querying, labeling, and updating the model.
A labeler: You have used a Labeler object from the modAL.utils.data module, which simulates the labeling process by returning the true labels of the queried samples.
An active learning loop: You have run an active learning loop for 10 iterations, where you have queried the labeler for the labels of the most uncertain samples, added them to the training set, and updated the model accordingly.
An evaluation and visualization: You have evaluated the performance of your active learning model by comparing its accuracy with a passive learning model that randomly selects samples from the pool. You have also visualized the results of the active learning loop by plotting the decision boundary of the model and the samples that are queried and labeled at each iteration.

You have successfully completed this tutorial and gained valuable skills and knowledge in active learning with Python. You can apply these skills and knowledge to your own machine learning projects, where you have a large amount of unlabeled data and a limited budget or resources for labeling. You can also explore other types of active learning strategies, such as stream-based sampling, batch-mode sampling, or query-by-committee, and see how they affect the performance and efficiency of your model.

If you want to learn more about active learning with Python, scikit-learn, and modAL, here are some useful resources:

The official documentation of scikit-learn, which provides a comprehensive guide to the library and its features, as well as examples and tutorials on various machine learning tasks.
The official documentation of modAL, which provides a detailed overview of the library and its modules, as well as examples and tutorials on different active learning scenarios.
The book Machine Learning with Python and scikit-learn by Prateek Joshi and Dipanjan Sarkar, which covers the fundamentals and applications of machine learning with Python and scikit-learn, including active learning.
The paper Active learning literature survey by Burr Settles, which provides a comprehensive survey of the active learning research and methods, as well as a taxonomy and a bibliography of the field.

Thank you for reading this tutorial and following along. I hope you enjoyed it and learned something new. If you have any questions or feedback, please feel free to leave a comment below. Happy learning!