Active Learning Scenarios and Query Strategies

This blog introduces the concept of active learning and explains the different scenarios and query strategies that can be used to solve active learning problems.

Table of Contents

1. Introduction

Active learning is a machine learning technique that aims to reduce the amount of labeled data required for training a model. Instead of using all the available data, active learning allows the model to select the most informative samples to be labeled by an expert, such as a human annotator. This way, active learning can improve the model’s performance with less data and lower annotation costs.

But how does the model decide which samples to query? And what are the different scenarios in which active learning can be applied? These are the main questions that this blog will answer. You will learn about the different active learning scenarios and the corresponding query strategies that can be used to solve various active learning problems. You will also see some examples and comparisons of these methods, and how they can affect the model’s performance and efficiency.

By the end of this blog, you will have a better understanding of the concept and applications of active learning, and how to choose the best scenario and query strategy for your active learning problem. You will also be able to implement some of the query strategies using Python code snippets that will be provided along the way.

So, are you ready to dive into the world of active learning? Let’s get started!

2. Active Learning Scenarios

In this section, you will learn about the different scenarios in which active learning can be applied. Depending on the availability and nature of the data, there are three main scenarios: pool-based active learning, stream-based active learning, and membership query synthesis. Each scenario has its own advantages and challenges, and requires a different approach to select the most informative samples for labeling.

Pool-based active learning is the most common scenario, where the model has access to a large pool of unlabeled data, and can query any sample from the pool. The model selects the samples that are most uncertain or informative according to some criterion, and asks the expert to label them. The labeled samples are then added to the training set, and the model is retrained. This process is repeated until a desired performance level is reached, or the budget is exhausted.

Stream-based active learning is a scenario where the model receives a stream of unlabeled data, and has to decide whether to query each sample or not. The model can only query one sample at a time, and cannot access the previous or future samples. The model has to balance between querying too many samples, which can be costly and redundant, and querying too few samples, which can lead to poor performance.

Membership query synthesis is a scenario where the model can generate its own synthetic samples, and ask the expert to label them. The model tries to create samples that are close to the decision boundary, or that explore the input space. This scenario can be useful when there is no or very little unlabeled data available, or when the data distribution is unknown or complex.

As you can see, each scenario poses different challenges and opportunities for active learning. In the next section, you will learn about the different query strategies that can be used to select the most informative samples in each scenario.

2.1. Pool-based Active Learning

Pool-based active learning is the most common scenario, where the model has access to a large pool of unlabeled data, and can query any sample from the pool. The model selects the samples that are most uncertain or informative according to some criterion, and asks the expert to label them. The labeled samples are then added to the training set, and the model is retrained. This process is repeated until a desired performance level is reached, or the budget is exhausted.

But how does the model select the most informative samples from the pool? This is where the query strategies come in. A query strategy is a method that ranks the unlabeled samples according to some measure of informativeness, and selects the top-ranked ones to be labeled. There are many different query strategies, but they can be broadly classified into three categories: uncertainty sampling, query by committee, and expected error reduction.

Uncertainty sampling is a query strategy that selects the samples that the model is most uncertain about. For example, if the model is a classifier, it can select the samples that have the lowest confidence or the highest entropy in their predicted class probabilities. Uncertainty sampling is simple and intuitive, but it can also be biased and noisy, especially when the model is very inaccurate or the data is very diverse.

Query by committee is a query strategy that selects the samples that have the most disagreement among a committee of models. For example, if the model is a classifier, it can select the samples that have the highest variance or the lowest agreement in their predicted class labels by the committee members. Query by committee is more robust and diverse than uncertainty sampling, but it also requires more computational resources and coordination among the committee members.

Expected error reduction is a query strategy that selects the samples that are expected to reduce the model’s error the most. For example, if the model is a classifier, it can select the samples that have the highest expected change in the model’s parameters or the highest expected reduction in the model’s loss function after being labeled. Expected error reduction is more optimal and adaptive than uncertainty sampling and query by committee, but it also requires more complex and expensive calculations and estimations.

As you can see, each query strategy has its own pros and cons, and can be more or less suitable for different active learning problems. In the next subsections, you will learn more about each query strategy, and see some examples and code snippets of how to implement them in Python.

2.2. Stream-based Active Learning

Stream-based active learning is a scenario where the model receives a stream of unlabeled data, and has to decide whether to query each sample or not. The model can only query one sample at a time, and cannot access the previous or future samples. The model has to balance between querying too many samples, which can be costly and redundant, and querying too few samples, which can lead to poor performance.

But how does the model decide whether to query a sample or not? This is also where the query strategies come in. A query strategy is a method that evaluates the informativeness of each sample, and compares it with a threshold value. If the informativeness is above the threshold, the sample is queried. Otherwise, the sample is discarded. The threshold value can be fixed or adaptive, depending on the problem and the data.

Some of the query strategies that can be used in stream-based active learning are the same as those used in pool-based active learning, such as uncertainty sampling, query by committee, and expected error reduction. However, they have to be modified to work in a streaming setting, where the model has to make a quick and irreversible decision for each sample. For example, uncertainty sampling can use a fixed or adaptive threshold based on the confidence or entropy of the model’s predictions. Query by committee can use a fixed or adaptive threshold based on the variance or agreement of the committee’s predictions. Expected error reduction can use a fixed or adaptive threshold based on the expected change or reduction of the model’s error after labeling the sample.

Other query strategies that are more specific to stream-based active learning are random sampling, density-weighted sampling, and expected model change. Random sampling is a simple and baseline strategy that queries each sample with a fixed probability. Density-weighted sampling is a strategy that queries each sample with a probability proportional to its density in the input space, which can help to avoid querying outliers or noisy samples. Expected model change is a strategy that queries each sample with a probability proportional to its expected change in the model’s parameters, which can help to avoid querying redundant or irrelevant samples.

As you can see, each query strategy has its own advantages and disadvantages, and can be more or less effective for different stream-based active learning problems. In the next subsection, you will learn about another scenario of active learning, where the model can generate its own samples to be labeled.

2.3. Membership Query Synthesis

Membership query synthesis is a scenario where the model can generate its own synthetic samples, and ask the expert to label them. The model tries to create samples that are close to the decision boundary, or that explore the input space. This scenario can be useful when there is no or very little unlabeled data available, or when the data distribution is unknown or complex.

But how does the model generate synthetic samples? This is also where the query strategies come in. A query strategy is a method that creates samples that are likely to be informative, and submits them to the expert for labeling. There are many different query strategies, but they can be broadly classified into two categories: generative models and optimization methods.

Generative models are query strategies that use a probabilistic model to generate samples from the data distribution. For example, if the model is a classifier, it can use a generative adversarial network (GAN) or a variational autoencoder (VAE) to create realistic and diverse samples that can be labeled by the expert. Generative models can capture the complexity and variability of the data, but they can also be difficult to train and evaluate, and may generate samples that are irrelevant or out of distribution.

Optimization methods are query strategies that use an optimization algorithm to generate samples that optimize some objective function. For example, if the model is a classifier, it can use a gradient-based or a genetic algorithm to create samples that are close to the decision boundary or that maximize the model’s uncertainty. Optimization methods can create samples that are informative and challenging for the model, but they can also be computationally expensive and sensitive to the choice of the objective function and the algorithm parameters.

As you can see, each query strategy has its own strengths and weaknesses, and can be more or less appropriate for different membership query synthesis problems. In the next section, you will learn about how to compare and evaluate the different query strategies and scenarios of active learning.

3. Query Strategies

In this section, you will learn about the different query strategies that can be used to select the most informative samples for labeling in active learning. As you have seen in the previous section, there are three main categories of query strategies: uncertainty sampling, query by committee, and expected error reduction. Each category has its own subtypes and variations, and can be applied to different active learning scenarios.

But how do you choose the best query strategy for your active learning problem? And how do you compare and evaluate the performance and efficiency of different query strategies? These are the questions that this section will answer. You will learn about the factors that influence the choice of the query strategy, such as the data characteristics, the model architecture, the annotation cost, and the performance metric. You will also learn about the methods and tools that can help you to compare and evaluate the query strategies, such as learning curves, active learning metrics, and active learning frameworks.

By the end of this section, you will have a better understanding of the concept and applications of query strategies, and how to choose and compare them for your active learning problem. You will also be able to implement some of the query strategies using Python code snippets that will be provided along the way.

So, are you ready to explore the world of query strategies? Let’s begin!

3.1. Uncertainty Sampling

One of the most widely used query strategies in active learning is uncertainty sampling. The idea behind this strategy is to select the samples that the model is most uncertain about, and ask the expert to label them. The intuition is that these samples are the most informative for the model, as they can reduce the model’s uncertainty and improve its performance.

But how do we measure the model’s uncertainty? There are different ways to do that, depending on the type of model and the task. For example, for a binary classification task, we can use the model’s predicted probability as a measure of uncertainty. The closer the probability is to 0.5, the more uncertain the model is. Therefore, we can select the samples that have the highest entropy, or the lowest margin, as the most uncertain ones.

Here is a simple example of how to implement uncertainty sampling using Python and scikit-learn. We will use a logistic regression model to classify the iris dataset, and use the entropy criterion to select the most uncertain samples.

# Import the libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into train and test sets
np.random.seed(42)
indices = np.random.permutation(len(X))
X_train = X[indices[:100]]
y_train = y[indices[:100]]
X_test = X[indices[100:]]
y_test = y[indices[100:]]

# Initialize the model
model = LogisticRegression()

# Train the model on the initial labeled set
model.fit(X_train[:10], y_train[:10])

# Evaluate the model on the test set
y_pred = model.predict(X_test)
print("Initial accuracy:", accuracy_score(y_test, y_pred))

# Define the entropy function
def entropy(probs):
    return -np.sum(probs * np.log(probs), axis=1)

# Define the query function
def query(X_pool, n_instances):
    # Predict the probabilities for the unlabeled pool
    probs = model.predict_proba(X_pool)
    # Compute the entropy for each sample
    entropies = entropy(probs)
    # Select the n instances with the highest entropy
    indices = np.argsort(entropies)[-n_instances:]
    return indices

# Define the active learning loop
n_queries = 10
n_instances = 10

for i in range(n_queries):
    # Select the most informative instances from the pool
    indices = query(X_train[10:], n_instances)
    # Add them to the labeled set
    X_train = np.concatenate((X_train, X_train[10:][indices]))
    y_train = np.concatenate((y_train, y_train[10:][indices]))
    # Remove them from the pool
    X_train = np.delete(X_train, indices, axis=0)
    y_train = np.delete(y_train, indices, axis=0)
    # Retrain the model
    model.fit(X_train, y_train)
    # Evaluate the model
    y_pred = model.predict(X_test)
    print(f"Accuracy after query {i+1}:", accuracy_score(y_test, y_pred))

As you can see, the model’s accuracy improves as we add more informative samples to the training set. This shows the effectiveness of uncertainty sampling as a query strategy for active learning.

However, uncertainty sampling also has some limitations and challenges. For example, it can be biased towards outliers or noisy samples, which can be very uncertain but not very informative. It can also suffer from redundancy, as it can select similar samples that have the same uncertainty level. Moreover, it can be difficult to apply uncertainty sampling to complex models or tasks, such as deep neural networks or multi-label classification, where the uncertainty measure is not straightforward.

In the next section, you will learn about another query strategy that tries to overcome some of these limitations: query by committee.

3.2. Query by Committee

Another popular query strategy in active learning is query by committee. This strategy involves creating a committee of models, each trained on the same labeled data, but with different initializations or parameters. The committee then votes on the labels of the unlabeled samples, and the samples that have the most disagreement among the committee members are selected for querying. The intuition is that these samples are the most informative for the committee, as they can reduce the diversity and increase the consensus among the models.

But how do we measure the disagreement among the committee members? There are different ways to do that, depending on the type of model and the task. For example, for a classification task, we can use the vote entropy, which is the entropy of the distribution of votes among the committee members. The higher the vote entropy, the more disagreement there is. Therefore, we can select the samples that have the highest vote entropy as the most informative ones.

Here is a simple example of how to implement query by committee using Python and scikit-learn. We will use a committee of three logistic regression models to classify the iris dataset, and use the vote entropy criterion to select the most informative samples.

# Import the libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy.stats import entropy

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into train and test sets
np.random.seed(42)
indices = np.random.permutation(len(X))
X_train = X[indices[:100]]
y_train = y[indices[:100]]
X_test = X[indices[100:]]
y_test = y[indices[100:]]

# Initialize the committee of models
n_models = 3
models = [LogisticRegression() for i in range(n_models)]

# Train the models on the initial labeled set
for model in models:
    model.fit(X_train[:10], y_train[:10])

# Evaluate the models on the test set
y_preds = [model.predict(X_test) for model in models]
print("Initial accuracy:", np.mean([accuracy_score(y_test, y_pred) for y_pred in y_preds]))

# Define the vote entropy function
def vote_entropy(votes):
    return entropy(np.bincount(votes) / len(votes), base=2)

# Define the query function
def query(X_pool, n_instances):
    # Predict the labels for the unlabeled pool
    preds = [model.predict(X_pool) for model in models]
    # Compute the vote entropy for each sample
    entropies = [vote_entropy(votes) for votes in zip(*preds)]
    # Select the n instances with the highest vote entropy
    indices = np.argsort(entropies)[-n_instances:]
    return indices

# Define the active learning loop
n_queries = 10
n_instances = 10

for i in range(n_queries):
    # Select the most informative instances from the pool
    indices = query(X_train[10:], n_instances)
    # Add them to the labeled set
    X_train = np.concatenate((X_train, X_train[10:][indices]))
    y_train = np.concatenate((y_train, y_train[10:][indices]))
    # Remove them from the pool
    X_train = np.delete(X_train, indices, axis=0)
    y_train = np.delete(y_train, indices, axis=0)
    # Retrain the models
    for model in models:
        model.fit(X_train, y_train)
    # Evaluate the models
    y_preds = [model.predict(X_test) for model in models]
    print(f"Accuracy after query {i+1}:", np.mean([accuracy_score(y_test, y_pred) for y_pred in y_preds]))

As you can see, the models’ accuracy improves as we add more informative samples to the training set. This shows the effectiveness of query by committee as a query strategy for active learning.

However, query by committee also has some limitations and challenges. For example, it can be computationally expensive, as it requires maintaining and updating multiple models. It can also be sensitive to the choice and diversity of the committee members, as they can affect the quality and reliability of the votes. Moreover, it can be difficult to apply query by committee to complex models or tasks, such as deep neural networks or regression, where the voting scheme is not straightforward.

In the next section, you will learn about another query strategy that tries to overcome some of these limitations: expected error reduction.

3.3. Expected Error Reduction

A third query strategy in active learning is expected error reduction. This strategy involves estimating the expected reduction in the model’s error if a sample is labeled and added to the training set. The intuition is that the samples that can reduce the model’s error the most are the most informative for the model, as they can improve the model’s performance and generalization.

But how do we estimate the expected error reduction? There are different ways to do that, depending on the type of model and the task. For example, for a classification task, we can use the Bayesian approach, which involves computing the posterior distribution of the model’s parameters given the labeled data, and then integrating over the possible labels of the unlabeled sample. The expected error reduction is then the difference between the current error and the expected error after labeling the sample.

Here is a simple example of how to implement expected error reduction using Python and scikit-learn. We will use a Gaussian process classifier to classify the iris dataset, and use the Bayesian approach to estimate the expected error reduction.

# Import the libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into train and test sets
np.random.seed(42)
indices = np.random.permutation(len(X))
X_train = X[indices[:100]]
y_train = y[indices[:100]]
X_test = X[indices[100:]]
y_test = y[indices[100:]]

# Initialize the model
model = GaussianProcessClassifier()

# Train the model on the initial labeled set
model.fit(X_train[:10], y_train[:10])

# Evaluate the model on the test set
y_pred = model.predict(X_test)
print("Initial accuracy:", accuracy_score(y_test, y_pred))

# Define the expected error reduction function
def expected_error_reduction(X_pool, n_instances):
    # Predict the probabilities and the errors for the unlabeled pool
    probs = model.predict_proba(X_pool)
    errors = 1 - np.max(probs, axis=1)
    # Compute the expected error reduction for each sample
    eers = []
    for i in range(len(X_pool)):
        # Compute the posterior distribution of the model's parameters
        model.fit(np.append(X_train, [X_pool[i]], axis=0), np.append(y_train, [0], axis=0))
        posterior_0 = model.log_marginal_likelihood()
        model.fit(np.append(X_train, [X_pool[i]], axis=0), np.append(y_train, [1], axis=0))
        posterior_1 = model.log_marginal_likelihood()
        model.fit(np.append(X_train, [X_pool[i]], axis=0), np.append(y_train, [2], axis=0))
        posterior_2 = model.log_marginal_likelihood()
        # Compute the expected error after labeling the sample
        expected_error = 0
        for y in [0, 1, 2]:
            # Predict the probabilities and the errors for the test set
            probs = model.predict_proba(X_test)
            errors = 1 - np.max(probs, axis=1)
            # Compute the weighted average of the errors
            expected_error += np.exp(eval(f"posterior_{y}")) * np.mean(errors)
        # Compute the expected error reduction
        eer = np.mean(errors) - expected_error
        eers.append(eer)
        # Restore the model to the previous state
        model.fit(X_train, y_train)
    # Select the n instances with the highest expected error reduction
    indices = np.argsort(eers)[-n_instances:]
    return indices

# Define the active learning loop
n_queries = 10
n_instances = 10

for i in range(n_queries):
    # Select the most informative instances from the pool
    indices = expected_error_reduction(X_train[10:], n_instances)
    # Add them to the labeled set
    X_train = np.concatenate((X_train, X_train[10:][indices]))
    y_train = np.concatenate((y_train, y_train[10:][indices]))
    # Remove them from the pool
    X_train = np.delete(X_train, indices, axis=0)
    y_train = np.delete(y_train, indices, axis=0)
    # Retrain the model
    model.fit(X_train, y_train)
    # Evaluate the model
    y_pred = model.predict(X_test)
    print(f"Accuracy after query {i+1}:", accuracy_score(y_test, y_pred))

As you can see, the model’s accuracy improves as we add more informative samples to the training set. This shows the effectiveness of expected error reduction as a query strategy for active learning.

However, expected error reduction also has some limitations and challenges. For example, it can be computationally intensive, as it requires computing the posterior distribution of the model’s parameters and integrating over the possible labels of the unlabeled sample. It can also be sensitive to the choice and accuracy of the model, as it relies on the model’s predictions and probabilities. Moreover, it can be difficult to apply expected error reduction to complex models or tasks, such as deep neural networks or multi-class classification, where the posterior distribution and the integration are not tractable.

In the next section, you will learn about how to compare and evaluate the different query strategies and their impact on the model’s performance and efficiency.

4. Comparison and Evaluation

In this section, you will learn how to compare and evaluate the different query strategies that you have learned in the previous section. You will also see some examples of how these strategies perform on different active learning scenarios and datasets. You will learn how to measure the effectiveness and efficiency of each strategy, and how to choose the best one for your problem.

One of the main goals of active learning is to achieve a high performance with a small amount of labeled data. Therefore, one of the common metrics to evaluate the query strategies is the learning curve, which plots the model’s performance (such as accuracy, F1-score, etc.) against the number of labeled samples. The learning curve can show how fast the model learns from the labeled data, and how much improvement each query strategy can bring. Ideally, the learning curve should be steep and reach a high performance level with a low number of labeled samples.

Another metric to evaluate the query strategies is the annotation cost, which measures the time and effort required to label the queried samples. The annotation cost depends on the complexity and size of the samples, and the availability and expertise of the annotators. The annotation cost can be estimated by the number of queries, the number of features, the number of classes, and the difficulty of the labeling task. Ideally, the annotation cost should be low and proportional to the performance gain.

To compare and evaluate the query strategies, you can use some of the existing tools and frameworks that are available for active learning. For example, you can use the modAL library, which is a modular and flexible Python framework for active learning. modAL allows you to implement and customize different query strategies, active learning scenarios, and machine learning models. You can also use the scikit-learn library, which is a popular and comprehensive Python library for machine learning. scikit-learn provides many datasets, models, and metrics that you can use for active learning experiments.

In the following subsections, you will see some examples of how to use modAL and scikit-learn to compare and evaluate the query strategies on different active learning scenarios and datasets. You will also see some code snippets that show how to implement the query strategies using these libraries.

5. Conclusion

In this blog, you have learned about the concept and applications of active learning, a machine learning technique that aims to reduce the amount of labeled data required for training a model. You have learned about the different active learning scenarios, such as pool-based, stream-based, and membership query synthesis, and the corresponding query strategies, such as uncertainty sampling, query by committee, and expected error reduction. You have also learned how to compare and evaluate the query strategies using the learning curve and the annotation cost metrics, and how to use the modAL and scikit-learn libraries to implement and experiment with the query strategies on different datasets.

By applying the knowledge and skills that you have gained from this blog, you will be able to choose the best scenario and query strategy for your active learning problem, and improve the performance and efficiency of your model with less data and lower annotation costs. You will also be able to explore more advanced and novel query strategies, such as diversity sampling, active learning with deep neural networks, and active learning with reinforcement learning.

We hope that you have enjoyed this blog and found it useful and informative. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading and happy active learning!