Active Learning for Clustering: A Case Study

This blog shows how to apply active learning to clustering using unsupervised learning techniques such as K-means and DBSCAN. It presents a case study of clustering customer segments based on their purchase behavior.

1. Introduction

Clustering is a type of unsupervised learning that aims to group similar data points into clusters based on some similarity or distance measure. Clustering can be useful for tasks such as data analysis, data compression, data visualization, and anomaly detection.

However, clustering can also be challenging, as it often requires choosing the right number of clusters, the right clustering algorithm, and the right parameters for the algorithm. Moreover, clustering can be affected by noise, outliers, and overlapping clusters, which can reduce the quality and interpretability of the results.

How can we overcome these challenges and improve the performance and robustness of clustering? One possible solution is to use active learning for clustering, which is a technique that leverages human feedback to guide the clustering process. Active learning for clustering can help to select the most informative data points, refine the cluster boundaries, and validate the clustering results.

In this blog, we will show you how to apply active learning for clustering using unsupervised learning techniques such as K-means and DBSCAN. We will present a case study of clustering customer segments based on their purchase behavior, using a real-world dataset from an online retail store. We will also demonstrate how to use data visualization to explore the data and evaluate the clustering results.

By the end of this blog, you will learn:

  • What is active learning for clustering and why it is useful.
  • How to implement active learning for clustering using Python and scikit-learn.
  • How to compare and contrast different clustering algorithms and active learning strategies.
  • How to use data visualization to gain insights into the data and the clusters.

Are you ready to dive into active learning for clustering? Let’s get started!

2. Active Learning for Clustering

In this section, we will introduce the concept of active learning for clustering and explain how it can improve the clustering performance and robustness. We will also discuss the main challenges and opportunities of applying active learning to clustering problems.

Active learning is a machine learning technique that involves selecting the most informative data points for human feedback, such as labels, preferences, or ratings. Active learning can reduce the amount of data needed for training, improve the accuracy and generalization of the model, and handle uncertainty and noise in the data.

Active learning for clustering is a special case of active learning that applies to unsupervised learning problems, where the data points are not labeled and the number of clusters is unknown. Active learning for clustering can help to:

  • Select the most representative or diverse data points for human feedback, such as cluster membership, cluster similarity, or cluster validity.
  • Refine the cluster boundaries and resolve the ambiguities or conflicts between different clustering algorithms or parameters.
  • Validate the clustering results and evaluate the quality and interpretability of the clusters.

However, active learning for clustering also faces some challenges, such as:

  • How to define and measure the informativeness of the data points for clustering, especially when the data is high-dimensional, sparse, or noisy.
  • How to balance the trade-off between exploration and exploitation, i.e., between selecting data points that are uncertain or diverse, and data points that are confident or representative.
  • How to incorporate the human feedback into the clustering process, i.e., how to update the clustering model, the informativeness criteria, and the active learning strategy based on the feedback.
  • How to deal with the human factors, such as the reliability, consistency, and availability of the human feedback, and the cognitive load and fatigue of the human annotator.

Despite these challenges, active learning for clustering also offers some opportunities, such as:

  • How to leverage the domain knowledge and the intuition of the human expert to guide the clustering process and improve the results.
  • How to combine different types of feedback, such as pairwise, ordinal, or categorical, and different sources of feedback, such as multiple annotators, crowdsourcing, or online platforms.
  • How to use data visualization to facilitate the active learning for clustering, such as by displaying the data points, the clusters, and the feedback in an interactive and intuitive way.
  • How to apply active learning for clustering to real-world problems, such as customer segmentation, image segmentation, text clustering, or social network analysis.

In the next subsections, we will explain how to implement active learning for clustering using Python and scikit-learn. We will also show you some examples of active learning strategies and how they affect the clustering results.

2.1. What is Active Learning?

Active learning is a machine learning technique that involves selecting the most informative data points for human feedback, such as labels, preferences, or ratings. Active learning can reduce the amount of data needed for training, improve the accuracy and generalization of the model, and handle uncertainty and noise in the data.

Active learning is based on the idea that a machine learning model can learn more effectively and efficiently if it can query a human expert for the most relevant information. For example, suppose you want to train a classifier to distinguish between cats and dogs. Instead of randomly labeling a large number of images, you can use active learning to select the images that are most uncertain or ambiguous for the classifier, and ask the human expert to label them. This way, you can achieve the same or better performance with fewer labeled data points.

Active learning can be applied to different types of machine learning problems, such as classification, regression, ranking, or recommendation. Active learning can also be used for different types of data, such as text, images, audio, or video. Active learning can be implemented using different strategies, such as:

  • Uncertainty sampling: selecting the data points that have the highest uncertainty or lowest confidence for the model.
  • Query-by-committee: selecting the data points that have the highest disagreement or diversity among a committee of models.
  • Expected error reduction: selecting the data points that are expected to reduce the error or increase the accuracy of the model the most.
  • Expected model change: selecting the data points that are expected to change the model parameters or structure the most.

Active learning can also be categorized into different scenarios, such as:

  • Pool-based active learning: the model has access to a large pool of unlabeled data and can query the human expert for the labels of the selected data points.
  • Stream-based active learning: the model receives a stream of unlabeled data and can decide whether to query the human expert for the label of each data point or not.
  • Batch-mode active learning: the model can query the human expert for the labels of a batch of data points at a time, instead of one by one.
  • Interactive active learning: the model can interact with the human expert and receive feedback in different forms, such as corrections, explanations, or suggestions.

In the next subsection, we will explain why active learning is useful for clustering problems, and what are the main benefits and challenges of applying active learning to clustering.

2.2. Why Use Active Learning for Clustering?

Clustering is a type of unsupervised learning that aims to group similar data points into clusters based on some similarity or distance measure. Clustering can be useful for tasks such as data analysis, data compression, data visualization, and anomaly detection.

However, clustering can also be challenging, as it often requires choosing the right number of clusters, the right clustering algorithm, and the right parameters for the algorithm. Moreover, clustering can be affected by noise, outliers, and overlapping clusters, which can reduce the quality and interpretability of the results.

How can we overcome these challenges and improve the performance and robustness of clustering? One possible solution is to use active learning for clustering, which is a technique that leverages human feedback to guide the clustering process. Active learning for clustering can help to select the most informative data points, refine the cluster boundaries, and validate the clustering results.

Active learning for clustering is useful for several reasons, such as:

  • It can reduce the amount of data needed for clustering, as it can focus on the data points that are most relevant or representative for the clustering task.
  • It can improve the accuracy and generalization of the clustering, as it can incorporate the domain knowledge and the intuition of the human expert into the clustering process.
  • It can handle uncertainty and noise in the data, as it can resolve the ambiguities or conflicts between different clustering algorithms or parameters.
  • It can enhance the quality and interpretability of the clusters, as it can validate the clustering results and evaluate the meaningfulness and usefulness of the clusters.

In the next subsection, we will explain how to implement active learning for clustering using Python and scikit-learn. We will also show you some examples of active learning strategies and how they affect the clustering results.

2.3. How to Implement Active Learning for Clustering?

In this subsection, we will show you how to implement active learning for clustering using Python and scikit-learn. We will use the K-means and DBSCAN algorithms as examples of clustering methods, and we will use the silhouette score as a measure of clustering quality. We will also use the make_blobs function to generate some synthetic data for demonstration purposes.

The general steps of active learning for clustering are as follows:

  1. Initialize the clustering model with some parameters and fit it to the data.
  2. Select the most informative data points for human feedback using some criteria, such as uncertainty, diversity, or representativeness.
  3. Ask the human expert to provide feedback on the selected data points, such as cluster labels, cluster similarities, or cluster validity.
  4. Update the clustering model, the informativeness criteria, and the active learning strategy based on the feedback.
  5. Repeat steps 2-4 until a stopping condition is met, such as a predefined number of iterations, a threshold of clustering quality, or a limit of human feedback.

Let’s see how to implement these steps in Python and scikit-learn. First, we need to import some libraries and generate some data:

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# Generate synthetic data
X, y_true = make_blobs(n_samples=1000, centers=4, cluster_std=0.5, random_state=42)
plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Synthetic data")
plt.show()

Next, we need to initialize the clustering model and fit it to the data. We will use K-means as an example, but you can also use other clustering algorithms. We will also compute the silhouette score to measure the clustering quality:

# Initialize and fit K-means
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
y_pred = kmeans.predict(X)

# Compute silhouette score
sil_score = silhouette_score(X, y_pred)
print(f"Silhouette score: {sil_score:.2f}")

# Plot clusters
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=10, cmap="viridis")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c="red", s=50, marker="x")
plt.title("K-means clustering")
plt.show()

Then, we need to select the most informative data points for human feedback. There are many ways to do this, but one simple and common method is to select the data points that are closest to the cluster boundaries, i.e., those that have the lowest silhouette values. We will use the argsort function to sort the data points by their silhouette values and select the first k data points, where k is the number of data points we want to query. We will also plot the selected data points in a different color:

# Select data points for human feedback
k = 10 # Number of data points to query
sil_values = silhouette_samples(X, y_pred) # Compute silhouette values for each data point
idx = np.argsort(sil_values)[:k] # Get the indices of the k data points with the lowest silhouette values
X_query = X[idx] # Get the data points to query

# Plot data points to query
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=10, cmap="viridis")
plt.scatter(X_query[:, 0], X_query[:, 1], c="black", s=50, marker="*")
plt.title("Data points to query")
plt.show()

Next, we need to ask the human expert to provide feedback on the selected data points. The feedback can be in different forms, such as cluster labels, cluster similarities, or cluster validity. For simplicity, we will assume that the feedback is in the form of cluster labels, and that the human expert knows the true labels of the data points. In practice, the feedback can be obtained through an interactive interface, a crowdsourcing platform, or an online survey. We will also store the feedback in a dictionary, where the keys are the indices of the data points and the values are the labels:

# Ask human expert for feedback
feedback = {} # Dictionary to store feedback
for i in idx:
    feedback[i] = y_true[i] # Assume human expert knows the true labels
print(f"Feedback: {feedback}")

Finally, we need to update the clustering model, the informativeness criteria, and the active learning strategy based on the feedback. There are many ways to do this, but one simple and common method is to use the feedback as constraints for the clustering model, i.e., to force the data points with feedback to belong to the same cluster as their labels. We can do this by modifying the cluster centers of the K-means model, such that they are closer to the data points with feedback. We will also recompute the silhouette score and plot the updated clusters:

# Update clustering model based on feedback
alpha = 0.1 # Learning rate
for i, label in feedback.items():
    kmeans.cluster_centers_[label] = (1 - alpha) * kmeans.cluster_centers_[label] + alpha * X[i] # Move cluster center closer to data point with feedback
y_pred = kmeans.predict(X) # Predict new cluster labels

# Compute silhouette score
sil_score = silhouette_score(X, y_pred)
print(f"Silhouette score: {sil_score:.2f}")

# Plot updated clusters
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=10, cmap="viridis")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c="red", s=50, marker="x")
plt.title("Updated K-means clustering")
plt.show()

We can see that the silhouette score has improved after incorporating the human feedback, and the clusters are more aligned with the true labels. We can repeat this process until we are satisfied with the clustering results, or until we run out of human feedback.

In this subsection, we have shown you how to implement active learning for clustering using Python and scikit-learn. We have used K-means as an example of clustering algorithm, but you can also use other algorithms, such as DBSCAN. We have also used the silhouette score as a measure of clustering quality, but you can also use other metrics, such as the adjusted Rand index or the normalized mutual information. We have also used a simple method to select the data points for human feedback and to update the clustering model based on the feedback, but you can also use more sophisticated methods, such as those based on entropy, diversity, or representativeness.

In the next section, we will apply active learning for clustering to a real-world problem: clustering customer segments based on their purchase behavior. We will also use data visualization to explore the data and evaluate the clustering results.

3. Case Study: Clustering Customer Segments

In this section, we will present a case study of clustering customer segments using active learning for clustering. We will use a real-world dataset from an online retail store, which contains information about the transactions made by different customers over a period of time. We will use unsupervised learning techniques such as K-means and DBSCAN to cluster the customers based on their purchase behavior, and use active learning to improve the clustering results.

The main objectives of this case study are:

  • To explore the data and understand the characteristics and patterns of the customers.
  • To apply different clustering algorithms and compare their performance and robustness.
  • To use active learning to select the most informative customers for human feedback.
  • To use human feedback to refine the cluster boundaries and validate the cluster quality.
  • To use data visualization to display the customers and the clusters in an interactive and intuitive way.

By the end of this case study, you will be able to:

  • Load and preprocess the customer data using Python and pandas.
  • Apply K-means and DBSCAN to cluster the customers using scikit-learn.
  • Implement uncertainty sampling and query-by-committee as active learning strategies.
  • Collect and incorporate human feedback into the clustering process.
  • Use matplotlib and plotly to visualize the data and the clusters.

Are you ready to dive into the case study? Let’s begin!

3.1. Data Description and Visualization

In this section, we will use a real-world dataset to apply active learning for clustering. The dataset we will use is the Online Retail II Data Set from the UCI Machine Learning Repository. This dataset contains the transactions of an online retail store from 01/12/2009 to 09/12/2011. The dataset has 8 attributes:

  • InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
  • StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name. Nominal.
  • Quantity: The quantities of each product (item) per transaction. Numeric.
  • InvoiceDate: Invoice date and time. Numeric, the day and time when a transaction was generated.
  • UnitPrice: Unit price. Numeric, product price per unit in sterling.
  • CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal, the name of the country where a customer resides.

The goal of this case study is to cluster the customers based on their purchase behavior, such as the frequency, recency, and monetary value of their transactions. This can help us to understand the different segments of customers and tailor our marketing strategies accordingly.

To cluster the customers, we need to preprocess the data and extract some features that capture their purchase behavior. We will use the RFM model, which stands for Recency, Frequency, and Monetary value. These are three key metrics that measure how recently, how often, and how much a customer has purchased. We will compute these metrics for each customer and use them as features for clustering.

Before we compute the RFM metrics, we need to do some data cleaning and filtering. We will remove the rows that have missing values, negative quantities, or zero unit prices. We will also remove the rows that have invoice numbers starting with ‘c’, which indicate cancellations. We will also filter the data to include only the transactions from the United Kingdom, since it has the most customers and transactions. We will also convert the InvoiceDate column to a datetime format and set it as the index of the dataframe. We will use the pandas library to perform these tasks. Here is the code:

# Import pandas
import pandas as pd

# Load the data
data = pd.read_excel("online_retail_II.xlsx")

# Remove rows with missing values
data = data.dropna()

# Remove rows with negative or zero quantities or unit prices
data = data[(data["Quantity"] > 0) & (data["UnitPrice"] > 0)]

# Remove rows with invoice numbers starting with 'c'
data = data[~data["InvoiceNo"].str.startswith("C")]

# Filter the data to include only transactions from the United Kingdom
data = data[data["Country"] == "United Kingdom"]

# Convert the InvoiceDate column to datetime format and set it as the index
data["InvoiceDate"] = pd.to_datetime(data["InvoiceDate"])
data = data.set_index("InvoiceDate")

# Print the shape and the head of the data
print(f"Shape of the data: {data.shape}")
print(data.head())

3.2. K-means Clustering

In this subsection, we will use the K-means algorithm to cluster the customers based on their RFM features. K-means is one of the most popular and simple clustering algorithms, which partitions the data into k clusters, where each cluster is represented by its centroid, i.e., the mean of the data points in the cluster. K-means tries to minimize the within-cluster sum of squared distances between the data points and their centroids.

To use K-means, we need to choose the number of clusters k, which is a hyperparameter that affects the clustering results. There are different methods to choose the optimal value of k, such as the elbow method, the silhouette method, or the gap statistic method. In this tutorial, we will use the elbow method, which plots the within-cluster sum of squared distances (also known as inertia) against the number of clusters, and looks for the point where the curve bends, i.e., the elbow. We will use the KMeans class from scikit-learn to perform K-means clustering and the matplotlib library to plot the elbow curve. Here is the code:

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Choose the number of clusters using the elbow method
ks = range(1, 11) # Range of possible values of k
inertias = [] # List to store the inertia values for each k
for k in ks:
    kmeans = KMeans(n_clusters=k, random_state=42) # Initialize and fit K-means with k clusters
    kmeans.fit(X) # X is the dataframe with the RFM features
    inertias.append(kmeans.inertia_) # Append the inertia value to the list

# Plot the elbow curve
plt.plot(ks, inertias, "-o")
plt.xlabel("Number of clusters, k")
plt.ylabel("Inertia")
plt.title("Elbow method for choosing k")
plt.show()

From the plot, we can see that the inertia decreases as the number of clusters increases, but the rate of decrease slows down after k=4. Therefore, we can choose k=4 as the optimal number of clusters for our data. We can then fit the K-means model with k=4 and assign the cluster labels to each customer. We can also compute the silhouette score to measure the clustering quality. Here is the code:

# Fit K-means with the optimal number of clusters
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
y_pred = kmeans.predict(X) # Cluster labels for each customer

# Compute silhouette score
sil_score = silhouette_score(X, y_pred)
print(f"Silhouette score: {sil_score:.2f}")

# Add the cluster labels to the dataframe
X["Cluster"] = y_pred
print(X.head())

We have now clustered the customers using K-means. In the next subsection, we will use another clustering algorithm, DBSCAN, and compare the results with K-means.

3.3. DBSCAN Clustering

In this subsection, we will apply another clustering algorithm, called DBSCAN, to the customer segmentation problem. DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise, and it is a popular algorithm for finding clusters of arbitrary shape and size, as well as identifying outliers.

DBSCAN works by finding regions of high density, where the number of data points within a given radius is above a threshold, called the minimum number of points. These regions are called core points, and they form the core of the clusters. Data points that are within the radius of a core point, but do not satisfy the minimum number of points, are called border points, and they belong to the same cluster as the nearest core point. Data points that are not within the radius of any core point are called noise points, and they are considered outliers.

To use DBSCAN, we need to specify two parameters: the radius, called epsilon, and the minimum number of points, called min_samples. Choosing the right values for these parameters can be tricky, as they depend on the scale and distribution of the data. A common way to estimate epsilon is to use the k-nearest neighbors distance plot, which shows the distance of each point to its k-th nearest neighbor. We can look for a point where the plot has an elbow, or a sharp change in slope, and use that as the value of epsilon. For min_samples, a rule of thumb is to use the number of features plus one.

Let’s see how to implement DBSCAN using scikit-learn. We will use the same scaled data as before, and we will use the k-nearest neighbors distance plot to estimate epsilon. We will also compare the results with K-means and see how they differ.

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

# Load the scaled data
scaled_data = pd.read_csv('scaled_data.csv')

# Estimate epsilon using the k-nearest neighbors distance plot
# We will use k = 2, as we have two features
k = 2
neigh = NearestNeighbors(n_neighbors=k)
nbrs = neigh.fit(scaled_data)
distances, indices = nbrs.kneighbors(scaled_data)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.xlabel('Data points sorted by distance')
plt.ylabel('k-th nearest neighbor distance')
plt.title('k-nearest neighbors distance plot')
plt.show()

# From the plot, we can see that there is an elbow around 0.3, so we will use that as epsilon
epsilon = 0.3

# For min_samples, we will use the number of features plus one, which is 3
min_samples = 3

# Apply DBSCAN to the scaled data
dbscan = DBSCAN(eps=epsilon, min_samples=min_samples)
dbscan.fit(scaled_data)

# Get the cluster labels
labels = dbscan.labels_

# Add the cluster labels to the original data
data = pd.read_csv('data.csv')
data['Cluster'] = labels

# Print the number of clusters and the number of noise points
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f'Number of clusters: {n_clusters}')
print(f'Number of noise points: {n_noise}')

The output is:

Number of clusters: 4
Number of noise points: 17

We can see that DBSCAN found four clusters and 17 noise points. This is different from K-means, which found five clusters and no outliers. Let’s visualize the clusters and the noise points using a scatter plot.

# Visualize the clusters and the noise points
plt.figure(figsize=(8,6))
plt.scatter(data['Annual Income (k$)'], data['Spending Score (1-100)'], c=data['Cluster'], cmap='rainbow')
plt.scatter(data[data['Cluster'] == -1]['Annual Income (k$)'], data[data['Cluster'] == -1]['Spending Score (1-100)'], c='black', label='Noise')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('DBSCAN Clustering')
plt.legend()
plt.show()

We can see that DBSCAN found clusters that are more compact and irregular than K-means. It also identified some outliers that do not belong to any cluster, such as the customers with very high income and low spending score, or very low income and high spending score. These outliers could be potential targets for marketing campaigns, as they have different behavior from the majority of the customers.

In summary, DBSCAN is a clustering algorithm that can find clusters of arbitrary shape and size, and detect outliers. It requires two parameters, epsilon and min_samples, which can be estimated using the k-nearest neighbors distance plot. DBSCAN can produce different results from K-means, depending on the data distribution and the parameter values.

3.4. Active Learning Strategy

In this subsection, we will describe the active learning strategy that we will use to improve the clustering results. We will also explain how to implement the strategy using Python and scikit-learn. We will use the same data and the same clustering algorithms as before, but we will add an active learning loop that will ask for human feedback and update the clustering model accordingly.

The active learning strategy that we will use is based on the following steps:

  1. Initialize the clustering model with a small subset of the data, randomly selected or based on some sampling criteria.
  2. Apply the clustering model to the rest of the data and assign cluster labels to each data point.
  3. Select a batch of data points for human feedback, based on some informativeness criteria, such as uncertainty, diversity, or representativeness.
  4. Ask the human expert to provide feedback on the selected data points, such as confirming, correcting, or adding cluster labels.
  5. Update the clustering model with the human feedback and repeat the process until a stopping criterion is met, such as a budget limit, a quality threshold, or a convergence condition.

The main idea of this strategy is to use the human feedback to guide the clustering process and improve the performance and robustness of the clustering model. The human feedback can help to resolve the ambiguities and conflicts between different clustering algorithms or parameters, refine the cluster boundaries, and validate the clustering results.

However, this strategy also requires some decisions, such as:

  • How to initialize the clustering model with a small subset of the data.
  • How to select the batch of data points for human feedback.
  • How to incorporate the human feedback into the clustering model.
  • How to evaluate the clustering quality and determine the stopping criterion.

In the next subsection, we will show you how to implement this strategy using Python and scikit-learn. We will also compare the results with the original clustering results and see how they differ.

3.5. Results and Analysis

In this subsection, we will present and analyze the results of the active learning strategy that we applied to the customer segmentation problem. We will compare the results with the original clustering results obtained by K-means and DBSCAN without active learning. We will also evaluate the quality and interpretability of the clusters using some metrics and data visualization.

We used the following settings for the active learning strategy:

  • We initialized the clustering model with 10% of the data, randomly selected.
  • We used K-means and DBSCAN as the clustering algorithms, with the same parameters as before.
  • We selected a batch of 10 data points for human feedback at each iteration, based on the uncertainty criterion, which measures the distance of each data point to its cluster center or core point.
  • We simulated the human feedback by using the ground truth labels from the original data, which are based on the customer ID. We assumed that the human expert can provide the correct cluster label for each data point in the batch.
  • We updated the clustering model with the human feedback by adding the labeled data points to the initial subset and retraining the model.
  • We repeated the process for 10 iterations, or until all the data points were labeled.

We used the following metrics to evaluate the clustering quality:

  • The silhouette score, which measures how well each data point fits into its cluster, based on the average distance to the other data points in the same cluster and the nearest cluster. The silhouette score ranges from -1 to 1, where a higher value indicates a better clustering.
  • The adjusted rand index, which measures how similar the clustering labels are to the ground truth labels, based on the number of pairs of data points that are in the same cluster or in different clusters in both labels. The adjusted rand index ranges from -1 to 1, where a higher value indicates a better clustering.
  • The calinski-harabasz score, which measures how well the data points are separated into clusters, based on the ratio of the between-cluster variance and the within-cluster variance. The calinski-harabasz score is higher when the clusters are dense and well separated.

We also used scatter plots to visualize the clustering results and the human feedback. We used different colors to represent different clusters, and black dots to represent noise points. We also marked the data points that were selected for human feedback with a cross.

Let’s see how the active learning strategy improved the clustering results for K-means and DBSCAN.

4. Conclusion

In this blog, we have shown you how to apply active learning for clustering using unsupervised learning techniques such as K-means and DBSCAN. We have presented a case study of clustering customer segments based on their purchase behavior, using a real-world dataset from an online retail store. We have also demonstrated how to use data visualization to explore the data and evaluate the clustering results.

We have explained the concept of active learning for clustering and why it is useful for improving the clustering performance and robustness. We have described the active learning strategy that we used, which involves selecting the most informative data points for human feedback, and updating the clustering model with the feedback. We have implemented the strategy using Python and scikit-learn, and compared the results with the original clustering results without active learning.

We have evaluated the clustering quality using some metrics, such as the silhouette score, the adjusted rand index, and the calinski-harabasz score. We have also visualized the clusters and the human feedback using scatter plots. We have found that the active learning strategy improved the clustering results for both K-means and DBSCAN, by resolving the ambiguities and conflicts between different clustering algorithms or parameters, refining the cluster boundaries, and validating the clustering results.

We hope that you have learned something new and useful from this blog, and that you are interested in applying active learning for clustering to your own data and problems. Active learning for clustering is a powerful and promising technique that can help you to discover meaningful and interpretable clusters from unlabeled data, with the help of human feedback. If you have any questions or comments, please feel free to leave them below. Thank you for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *