Machine Learning Evaluation Mastery: How to Use Silhouette Score and Calinski-Harabasz Index for Clustering Problems

This blog will teach you how to use silhouette score and Calinski-Harabasz index to evaluate and compare clustering performance in Python.

1. Introduction

Clustering is one of the most common and useful techniques in machine learning. It allows you to group similar data points together based on some criteria, such as distance, density, or connectivity. Clustering can help you discover hidden patterns, identify outliers, and reduce dimensionality in your data.

But how do you know if your clustering results are good or not? How do you compare different clustering algorithms or parameters? How do you measure the quality of your clusters?

In this blog, you will learn how to use two popular metrics to evaluate and compare clustering performance: silhouette score and Calinski-Harabasz index. These metrics can help you assess how well your data points are assigned to clusters, how compact and separated your clusters are, and how robust your clustering results are to noise and outliers.

You will also learn how to use these metrics in Python, using the scikit-learn library. You will see how to apply them to different clustering problems, such as k-means, hierarchical, and DBSCAN clustering. You will also learn how to interpret and compare the results, and how to choose the best clustering method for your data.

By the end of this blog, you will have a solid understanding of how to use silhouette score and Calinski-Harabasz index for clustering problems, and how to improve your cluster quality and performance.

Are you ready to dive into the world of clustering evaluation? Let’s get started!

2. What is Clustering and Why is it Important?

Clustering is a type of unsupervised machine learning, which means that you do not have any predefined labels or categories for your data. Instead, you want to find the natural groups or clusters that exist in your data based on some similarity or dissimilarity measure.

For example, suppose you have a dataset of customers and you want to segment them based on their preferences, behavior, or demographics. You can use clustering to find the different types of customers that exist in your data, such as loyal, occasional, or new customers. This can help you tailor your marketing strategies, products, or services to each customer segment.

Another example is image segmentation, where you want to divide an image into regions that have similar characteristics, such as color, texture, or shape. You can use clustering to find the boundaries of different objects or regions in an image, such as a person, a car, or a sky. This can help you with tasks such as object detection, face recognition, or image compression.

As you can see, clustering has many applications and benefits in various domains, such as business, science, engineering, and art. Clustering can help you:

  • Discover hidden patterns and insights in your data
  • Identify outliers and anomalies in your data
  • Reduce dimensionality and complexity in your data
  • Visualize and understand your data better
  • Improve the performance and accuracy of other machine learning models

But how do you perform clustering? What are the different methods and algorithms for clustering? And most importantly, how do you evaluate and compare clustering results? These are the questions that we will answer in the next sections.

3. How to Evaluate Clustering Performance?

One of the challenges of clustering is that there is no single or definitive way to evaluate the quality and performance of your clustering results. Unlike supervised learning, where you can compare your predictions with the true labels, unsupervised learning does not have a ground truth to validate your results.

However, this does not mean that you cannot measure or compare your clustering results at all. There are several methods and metrics that can help you assess how well your data points are clustered, how compact and separated your clusters are, and how robust your clustering results are to noise and outliers.

Some of these methods and metrics are:

  • Internal methods: These methods use only the data and the clustering results to evaluate the clustering performance. They do not require any external information or labels. They are useful when you do not have any prior knowledge or expectations about your data or clusters. Examples of internal methods are silhouette score and Calinski-Harabasz index, which we will discuss in detail in the next sections.
  • External methods: These methods use some external information or labels to evaluate the clustering performance. They are useful when you have some prior knowledge or expectations about your data or clusters, such as the number of clusters, the distribution of data points, or the similarity between clusters. Examples of external methods are adjusted Rand index, normalized mutual information, and homogeneity and completeness.
  • Relative methods: These methods compare the clustering performance of different clustering algorithms or parameters on the same data. They are useful when you want to find the best clustering method or parameter for your data. Examples of relative methods are elbow method, gap statistic, and stability analysis.

In this blog, we will focus on two internal methods: silhouette score and Calinski-Harabasz index. These are two of the most widely used and intuitive metrics for clustering evaluation. They can help you answer questions such as:

  • How well are the data points assigned to their clusters?
  • How compact and separated are the clusters?
  • How many clusters are optimal for the data?
  • How do different clustering methods or parameters affect the clustering performance?

In the next sections, we will explain what these metrics are, how they are calculated, and how they can be used in Python.

3.1. Silhouette Score

The silhouette score is a metric that measures how well each data point fits into its assigned cluster. It is based on two distances: the average distance between a data point and all other points in the same cluster (called intra-cluster distance), and the average distance between a data point and all other points in the nearest cluster (called inter-cluster distance).

The silhouette score for each data point is calculated as:

$s = \frac{b - a}{\max(a, b)}$

where a is the intra-cluster distance and b is the inter-cluster distance. The silhouette score ranges from -1 to 1, where a high value indicates that the data point is well matched to its cluster and poorly matched to neighboring clusters, and a low value indicates the opposite.

The silhouette score for the entire dataset is the average of the silhouette scores of all data points. The higher the silhouette score, the better the clustering quality, as it means that the clusters are compact and separated.

Some of the advantages of the silhouette score are:

  • It is easy to interpret and intuitive.
  • It can be used for any clustering algorithm and any number of clusters.
  • It can be used to compare different clustering results and find the optimal number of clusters.

Some of the limitations of the silhouette score are:

  • It can be computationally expensive, as it requires calculating the distances between all pairs of data points.
  • It can be sensitive to noise and outliers, as they can affect the distances between data points.
  • It can be biased towards convex clusters, as it assumes that the clusters are spherical and well-separated.

In the next section, we will see how to use the silhouette score in Python, using the scikit-learn library.

3.2. Calinski-Harabasz Index

The Calinski-Harabasz index is another metric that measures how well the data points are clustered. It is based on the ratio of the between-cluster variance and the within-cluster variance. The between-cluster variance measures how far the clusters are from each other, and the within-cluster variance measures how close the data points are within each cluster.

The Calinski-Harabasz index for a given clustering result is calculated as:

$CH = \frac{B / (k - 1)}{W / (n - k)}$

where B is the sum of squared distances between each cluster centroid and the global centroid, W is the sum of squared distances between each data point and its cluster centroid, k is the number of clusters, and n is the number of data points. The Calinski-Harabasz index is also known as the variance ratio criterion.

The higher the Calinski-Harabasz index, the better the clustering quality, as it means that the clusters are well separated and compact. The Calinski-Harabasz index can be used to compare different clustering results and find the optimal number of clusters, by choosing the one that maximizes the index.

Some of the advantages of the Calinski-Harabasz index are:

  • It is simple and fast to compute.
  • It can be used for any clustering algorithm and any number of clusters.
  • It is less sensitive to noise and outliers than the silhouette score.

Some of the limitations of the Calinski-Harabasz index are:

  • It can be biased towards spherical clusters, as it assumes that the clusters have equal size and density.
  • It can be affected by the scaling of the data, as it depends on the distances between data points.
  • It can be difficult to interpret, as it does not have a clear range or threshold.

In the next section, we will see how to use the Calinski-Harabasz index in Python, using the scikit-learn library.

4. How to Use Silhouette Score and Calinski-Harabasz Index in Python?

In this section, we will show you how to use the silhouette score and the Calinski-Harabasz index in Python, using the scikit-learn library. We will use a sample dataset of two-dimensional points that can be clustered into four groups. You can download the dataset from here.

First, we need to import the necessary libraries and load the dataset:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score

# Load the dataset
df = pd.read_csv("clustering.csv")
X = df[["x", "y"]].values

Next, we need to apply some clustering algorithms to the data and obtain the cluster labels. We will use three different algorithms: k-means, hierarchical, and DBSCAN. We will also use different values for the number of clusters or the parameters of the algorithms. You can learn more about these algorithms and their parameters from the scikit-learn documentation.

# Apply k-means clustering with different values of k
kmeans_2 = KMeans(n_clusters=2, random_state=42).fit(X)
kmeans_3 = KMeans(n_clusters=3, random_state=42).fit(X)
kmeans_4 = KMeans(n_clusters=4, random_state=42).fit(X)
kmeans_5 = KMeans(n_clusters=5, random_state=42).fit(X)

# Apply hierarchical clustering with different values of k
hier_2 = AgglomerativeClustering(n_clusters=2).fit(X)
hier_3 = AgglomerativeClustering(n_clusters=3).fit(X)
hier_4 = AgglomerativeClustering(n_clusters=4).fit(X)
hier_5 = AgglomerativeClustering(n_clusters=5).fit(X)

# Apply DBSCAN clustering with different values of eps
dbscan_05 = DBSCAN(eps=0.5).fit(X)
dbscan_1 = DBSCAN(eps=1).fit(X)
dbscan_15 = DBSCAN(eps=1.5).fit(X)
dbscan_2 = DBSCAN(eps=2).fit(X)

Now, we can calculate the silhouette score and the Calinski-Harabasz index for each clustering result, using the silhouette_score and calinski_harabasz_score functions from scikit-learn. We will store the results in a dictionary for convenience.

# Calculate the silhouette score and the Calinski-Harabasz index for each clustering result
scores = {}
for name, labels in [("kmeans_2", kmeans_2.labels_), ("kmeans_3", kmeans_3.labels_), ("kmeans_4", kmeans_4.labels_), ("kmeans_5", kmeans_5.labels_),
                      ("hier_2", hier_2.labels_), ("hier_3", hier_3.labels_), ("hier_4", hier_4.labels_), ("hier_5", hier_5.labels_),
                      ("dbscan_05", dbscan_05.labels_), ("dbscan_1", dbscan_1.labels_), ("dbscan_15", dbscan_15.labels_), ("dbscan_2", dbscan_2.labels_)]:
    scores[name] = (silhouette_score(X, labels), calinski_harabasz_score(X, labels))

Finally, we can print the scores and plot the clustering results to see how they compare. We will use a helper function to plot the data points with different colors according to their cluster labels.

# Define a helper function to plot the clustering results
def plot_clusters(X, labels, title):
    plt.figure(figsize=(8, 6))
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="rainbow")
    plt.title(title)
    plt.xlabel("x")
    plt.ylabel("y")
    plt.show()

# Print the scores and plot the clustering results
for name, (s_score, ch_score) in scores.items():
    print(f"{name}: Silhouette Score = {s_score:.2f}, Calinski-Harabasz Index = {ch_score:.2f}")
    plot_clusters(X, scores[name][0], name)

5. How to Interpret and Compare the Results?

Now that you have learned how to use the silhouette score and the Calinski-Harabasz index in Python, you might be wondering how to interpret and compare the results. In this section, we will give you some tips and examples on how to do that.

First, let’s review the meaning and range of these metrics:

  • The silhouette score measures how well each data point fits into its assigned cluster. It ranges from -1 to 1, where a high value indicates that the data point is well matched to its cluster and poorly matched to neighboring clusters, and a low value indicates the opposite.
  • The Calinski-Harabasz index measures the ratio of the between-cluster variance and the within-cluster variance. It does not have a clear range or threshold, but a higher value indicates that the clusters are well separated and compact.

Second, let’s look at some general guidelines on how to use these metrics:

  • Both metrics can be used to compare different clustering results and find the optimal number of clusters, by choosing the one that maximizes the metric.
  • Both metrics can be affected by the scaling of the data, the shape and size of the clusters, and the presence of noise and outliers. Therefore, it is important to preprocess the data and remove outliers before applying clustering and evaluation.
  • Both metrics are not absolute measures of clustering quality, but relative indicators. Therefore, it is important to use them in conjunction with other methods, such as visual inspection, domain knowledge, and external validation.

Third, let’s see some examples of how to interpret and compare the results using the sample dataset that we used in the previous section. Here are the code for scores and plots of the clustering results:

# Print the scores and plot the clustering results
for name, (s_score, ch_score) in scores.items():
    print(f"{name}: Silhouette Score = {s_score:.2f}, Calinski-Harabasz Index = {ch_score:.2f}")
    plot_clusters(X, scores[name][0], name)
  • The optimal number of clusters for this dataset seems to be 4, as it corresponds to the natural groups that we can see in the data. This is confirmed by the highest silhouette score and Calinski-Harabasz index for k-means and hierarchical clustering with k=4.
  • k-means and hierarchical clustering perform similarly on this dataset, as they produce similar cluster shapes and sizes. However, k-means seems to be slightly better, as it has higher silhouette score and Calinski-Harabasz index for most values of k.
  • DBSCAN clustering performs poorly on this dataset, as it produces many outliers and irregular clusters. This is reflected by the low silhouette score and Calinski-Harabasz index for most values of eps. This is because DBSCAN is not suitable for data with varying densities and shapes.

Therefore, we can conclude that k-means clustering with k=4 is the best clustering method and parameter for this dataset, based on the silhouette score and the Calinski-Harabasz index.

Of course, this is just one example of how to interpret and compare the results using these metrics. You can apply the same principles and methods to other datasets and clustering problems, and see how they work for you.

In the next and final section, we will summarize the main points of this blog and give you some suggestions for further learning.

6. Conclusion

In this blog, you have learned how to use two popular metrics for clustering evaluation: silhouette score and Calinski-Harabasz index. These metrics can help you measure and compare the quality and performance of your clustering results, and find the optimal number of clusters for your data.

You have also learned how to use these metrics in Python, using the scikit-learn library. You have seen how to apply them to different clustering algorithms and parameters, such as k-means, hierarchical, and DBSCAN clustering. You have also learned how to interpret and compare the results, using a sample dataset of two-dimensional points.

By following this blog, you have gained a solid understanding of how to use silhouette score and Calinski-Harabasz index for clustering problems, and how to improve your cluster quality and performance.

We hope that you have enjoyed this blog and found it useful and informative. If you want to learn more about clustering evaluation and other machine learning topics, you can check out the following resources:

Thank you for reading this blog and happy clustering!

Leave a Reply

Your email address will not be published. Required fields are marked *