Machine Learning Evaluation Mastery: How to Use Davies-Bouldin Index and Dunn Index for Clustering Problems

This blog teaches you how to use Davies-Bouldin index and Dunn index for clustering problems. You will learn how to calculate and interpret these indices and see some examples.

1. Introduction

Clustering is a popular machine learning technique that groups data points based on their similarity. It is useful for exploratory data analysis, dimensionality reduction, and finding patterns in complex data sets. However, how do you know if your clustering results are good or not? How do you compare different clustering algorithms or parameters?

One way to answer these questions is to use cluster validity indices. These are numerical measures that evaluate the quality of a clustering solution based on some criteria. There are many cluster validity indices available, but in this blog, we will focus on two of them: Davies-Bouldin index and Dunn index. These indices are widely used and easy to understand and implement.

In this blog, you will learn:

  • What are Davies-Bouldin index and Dunn index and how do they measure cluster validity?
  • How to calculate Davies-Bouldin index and Dunn index for any clustering solution?
  • How to interpret Davies-Bouldin index and Dunn index and what are their advantages and disadvantages?
  • How to use Davies-Bouldin index and Dunn index for clustering problems with some examples?

By the end of this blog, you will have a solid understanding of Davies-Bouldin index and Dunn index and how to use them for clustering problems. You will also be able to apply these indices to your own data sets and evaluate your clustering results.

Are you ready to master Davies-Bouldin index and Dunn index? Let’s get started!

2. What are Davies-Bouldin index and Dunn index?

Davies-Bouldin index and Dunn index are two cluster validity indices that measure the quality of a clustering solution. They are based on two criteria: intra-cluster similarity and inter-cluster dissimilarity. Intra-cluster similarity means how similar the data points within a cluster are, and inter-cluster dissimilarity means how different the data points between different clusters are.

A good clustering solution should have high intra-cluster similarity and low inter-cluster dissimilarity. This means that the data points within a cluster should be very similar to each other, and the data points between different clusters should be very different from each other. This way, the clusters are well-separated and well-defined.

Davies-Bouldin index and Dunn index use different ways to quantify these criteria and evaluate the clustering solution. Davies-Bouldin index uses the ratio of intra-cluster similarity and inter-cluster dissimilarity, while Dunn index uses the ratio of inter-cluster dissimilarity and intra-cluster similarity. The lower the Davies-Bouldin index, the better the clustering solution. The higher the Dunn index, the better the clustering solution.

Why do we need these indices? Because clustering is an unsupervised machine learning technique, which means that we do not have any labels or ground truth to compare our results with. Therefore, we need some objective measures to assess the quality of our clustering solution and choose the best one among different options.

In the next sections, we will see how to calculate and interpret Davies-Bouldin index and Dunn index in more detail. We will also see some examples of how to use them for clustering problems.

2.1. Davies-Bouldin index

The Davies-Bouldin index (DBI) is a cluster validity index that measures the average similarity between each cluster and its most similar cluster. The similarity is based on two factors: the within-cluster distance and the between-cluster distance. The within-cluster distance is the average distance of the data points in a cluster to their cluster center, also known as the cluster centroid. The between-cluster distance is the distance between the centroids of two clusters.

The DBI is calculated as follows:

$$DBI = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{s_i + s_j}{d_{ij}} \right)$$

where:

  • $k$ is the number of clusters
  • $s_i$ is the average within-cluster distance of cluster $i$
  • $s_j$ is the average within-cluster distance of cluster $j$
  • $d_{ij}$ is the between-cluster distance of cluster $i$ and $j$

The DBI is the average of the maximum similarity ratios for each cluster. The similarity ratio is the sum of the within-cluster distances divided by the between-cluster distance. A high similarity ratio means that the clusters are not well-separated or well-defined. Therefore, a lower DBI indicates a better clustering solution.

How do you use the DBI to evaluate your clustering solution? You can calculate the DBI for different clustering algorithms or parameters and compare them. The one with the lowest DBI is the best one. You can also use the DBI to determine the optimal number of clusters for your data set. You can plot the DBI values for different numbers of clusters and look for the elbow point, where the DBI drops significantly and then levels off. This is the optimal number of clusters that balances the trade-off between intra-cluster similarity and inter-cluster dissimilarity.

In the next section, we will see how to calculate the DBI using Python and some examples of data sets.

2.2. Dunn index

The Dunn index (DI) is another cluster validity index that measures the quality of a clustering solution. It is based on the same criteria as the Davies-Bouldin index, but in a different way. The DI is calculated as follows:

$$DI = \frac{\min_{i \neq j} d_{ij}}{\max_{k} s_k}$$

where:

  • $d_{ij}$ is the between-cluster distance of cluster $i$ and $j$
  • $s_k$ is the average within-cluster distance of cluster $k$

The DI is the ratio of the minimum between-cluster distance and the maximum within-cluster distance. A high DI means that the clusters are well-separated and well-defined. Therefore, a higher DI indicates a better clustering solution.

How do you use the DI to evaluate your clustering solution? You can use the same methods as for the DBI. You can calculate the DI for different clustering algorithms or parameters and compare them. The one with the highest DI is the best one. You can also use the DI to determine the optimal number of clusters for your data set. You can plot the DI values for different numbers of clusters and look for the elbow point, where the DI increases significantly and then levels off. This is the optimal number of clusters that balances the trade-off between intra-cluster similarity and inter-cluster dissimilarity.

In the next section, we will see how to calculate the DI using Python and some examples of data sets.

3. How to calculate Davies-Bouldin index and Dunn index?

In this section, we will see how to calculate the Davies-Bouldin index and the Dunn index using Python. We will use the scikit-learn library, which is a popular and powerful tool for machine learning in Python. We will also use the NumPy library, which is a fundamental package for scientific computing in Python.

First, we need to import the necessary modules from scikit-learn and NumPy. We will use the KMeans class to perform the clustering algorithm, the pairwise_distances function to compute the distances between data points, and the make_blobs function to generate some synthetic data sets for testing. We will also use the numpy module to perform some mathematical operations.

from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
from sklearn.datasets import make_blobs
import numpy as np

Next, we need to define two functions to calculate the Davies-Bouldin index and the Dunn index. These functions will take the data set and the clustering labels as inputs and return the corresponding index value as output. The functions will use the formulas that we saw in the previous sections to compute the indices.

def davies_bouldin_index(X, labels):
    # X is the data set, labels are the cluster labels
    # n_clusters is the number of clusters
    n_clusters = len(np.unique(labels))
    # cluster_k is a list of indices of data points in cluster k
    cluster_k = [X[labels == k] for k in range(n_clusters)]
    # centroids is an array of the cluster centroids
    centroids = [np.mean(k, axis = 0) for k in cluster_k]
    # S is an array of the average within-cluster distances
    S = [np.mean(pairwise_distances(cluster_k[i], [centroids[i]])) for i in range(n_clusters)]
    # D is an array of the between-cluster distances
    D = pairwise_distances(centroids)
    # DBI is the Davies-Bouldin index
    DBI = np.mean([np.max([(S[i] + S[j])/D[i,j] for j in range(n_clusters) if j != i]) for i in range(n_clusters)])
    return DBI

def dunn_index(X, labels):
    # X is the data set, labels are the cluster labels
    # n_clusters is the number of clusters
    n_clusters = len(np.unique(labels))
    # cluster_k is a list of indices of data points in cluster k
    cluster_k = [X[labels == k] for k in range(n_clusters)]
    # centroids is an array of the cluster centroids
    centroids = [np.mean(k, axis = 0) for k in cluster_k]
    # S is an array of the average within-cluster distances
    S = [np.mean(pairwise_distances(cluster_k[i], [centroids[i]])) for i in range(n_clusters)]
    # D is an array of the between-cluster distances
    D = pairwise_distances(centroids)
    # DI is the Dunn index
    DI = np.min(D)/np.max(S)
    return DI

Now, we can use these functions to calculate the Davies-Bouldin index and the Dunn index for any clustering solution. For example, let’s generate a data set with 300 data points and 4 clusters using the make_blobs function. Then, let’s apply the KMeans algorithm with 4 clusters and get the cluster labels. Finally, let’s calculate the indices using our functions.

# generate a data set with 300 data points and 4 clusters
X, y = make_blobs(n_samples = 300, n_features = 2, centers = 4, random_state = 0)
# apply KMeans with 4 clusters
kmeans = KMeans(n_clusters = 4, random_state = 0)
kmeans.fit(X)
# get the cluster labels
labels = kmeans.labels_
# calculate the Davies-Bouldin index
dbi = davies_bouldin_index(X, labels)
# calculate the Dunn index
di = dunn_index(X, labels)
# print the results
print("Davies-Bouldin index:", dbi)
print("Dunn index:", di)

The output is:

Davies-Bouldin index: 0.6974433096744395
Dunn index: 0.2230299769953768

These values indicate that the clustering solution is fairly good, as the Davies-Bouldin index is low and the Dunn index is high. However, we can also compare these values with other clustering solutions to see if we can improve them. For example, what if we change the number of clusters to 3 or 5? Let’s try it and see the results.

# apply KMeans with 3 clusters
kmeans = KMeans(n_clusters = 3, random_state = 0)
kmeans.fit(X)
# get the cluster labels
labels = kmeans.labels_
# calculate the Davies-Bouldin index
dbi = davies_bouldin_index(X, labels)
# calculate the Dunn index
di = dunn_index(X, labels)
# print the results
print("Davies-Bouldin index:", dbi)
print("Dunn index:", di)

The output is:

Davies-Bouldin index: 0.7524564405094247
Dunn index: 0.1818870589496198

These values indicate that the clustering solution is worse than the previous one, as the Davies-Bouldin index is higher and the Dunn index is lower. This means that the clusters are less separated and less defined. Therefore, 3 clusters is not a good choice for this data set.

# apply KMeans with 5 clusters
kmeans = KMeans(n_clusters = 5, random_state = 0)
kmeans.fit(X)
# get the cluster labels
labels = kmeans.labels_
# calculate the Davies-Bouldin index
dbi = davies_bouldin_index(X, labels)
# calculate the Dunn index
di = dunn_index(X, labels)
# print the results
print("Davies-Bouldin index:", dbi)
print("Dunn index:", di)

The output is:

Davies-Bouldin index: 0.6589841436699859
Dunn index: 0.2362918753667479

These values indicate that the clustering solution is slightly better than the previous one, as the Davies-Bouldin index is lower and the Dunn index is higher. This means that the clusters are more separated and more defined. However, the improvement is not very significant, and 5 clusters might be too many for this data set.

Therefore, we can conclude that 4 clusters is the optimal number of clusters for this data set, based on the Davies-Bouldin index and the Dunn index. Of course, these indices are not the only way to evaluate the clustering solution, and there might be other factors to consider, such as the domain knowledge, the data distribution, and the clustering objectives. However, they are useful tools to provide some quantitative measures of the cluster validity and help us compare different clustering solutions.

In the next section, we will see how to interpret the Davies-Bouldin index and the Dunn index and what are their advantages and disadvantages.

3.1. Davies-Bouldin index formula

In this section, we will explain the formula of the Davies-Bouldin index (DBI) and how it measures the cluster validity. As we saw in the previous section, the DBI is calculated as follows:

$$DBI = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{s_i + s_j}{d_{ij}} \right)$$

where:

  • $k$ is the number of clusters
  • $s_i$ is the average within-cluster distance of cluster $i$
  • $s_j$ is the average within-cluster distance of cluster $j$
  • $d_{ij}$ is the between-cluster distance of cluster $i$ and $j$

Let’s break down this formula and see what it means. The DBI is the average of the maximum similarity ratios for each cluster. The similarity ratio is the sum of the within-cluster distances divided by the between-cluster distance. The within-cluster distance is the average distance of the data points in a cluster to their cluster center, also known as the cluster centroid. The between-cluster distance is the distance between the centroids of two clusters.

The similarity ratio measures how similar two clusters are. A high similarity ratio means that the clusters are not well-separated or well-defined, as the data points within the clusters are far from their centroids and the centroids of the clusters are close to each other. A low similarity ratio means that the clusters are well-separated and well-defined, as the data points within the clusters are close to their centroids and the centroids of the clusters are far from each other.

The DBI takes the maximum similarity ratio for each cluster and averages them. This means that the DBI is sensitive to the worst-case scenario, where the clusters are most similar to each other. A high DBI means that the clustering solution has at least one pair of clusters that are very similar to each other, which is not desirable. A low DBI means that the clustering solution has no pair of clusters that are very similar to each other, which is desirable.

Therefore, the DBI measures the cluster validity based on the trade-off between intra-cluster similarity and inter-cluster dissimilarity. A lower DBI indicates a better clustering solution, as the clusters are more separated and more defined.

In the next section, we will see how to calculate the DBI using Python and some examples of data sets.

3.2. Dunn index formula

The Dunn index (DI) is another cluster validity index that measures the quality of a clustering solution. It is based on the same criteria as the Davies-Bouldin index, but in a different way. The DI is calculated as follows:

$$DI = \frac{\min_{i \neq j} d_{ij}}{\max_{k} s_k}$$

where:

  • $d_{ij}$ is the between-cluster distance of cluster $i$ and $j$
  • $s_k$ is the average within-cluster distance of cluster $k$

Let’s break down this formula and see what it means. The DI is the ratio of the minimum between-cluster distance and the maximum within-cluster distance. A high DI means that the clusters are well-separated and well-defined. Therefore, a higher DI indicates a better clustering solution.

The between-cluster distance is the distance between the centroids of two clusters. The minimum between-cluster distance is the smallest distance among all the pairs of clusters. This means that the DI is sensitive to the best-case scenario, where the clusters are most separated from each other. A high minimum between-cluster distance means that the clustering solution has no pair of clusters that are very close to each other, which is desirable.

The within-cluster distance is the average distance of the data points in a cluster to their cluster center, also known as the cluster centroid. The maximum within-cluster distance is the largest distance among all the clusters. This means that the DI is sensitive to the worst-case scenario, where the clusters are least defined. A low maximum within-cluster distance means that the clustering solution has no cluster that is very dispersed, which is desirable.

Therefore, the DI measures the cluster validity based on the trade-off between intra-cluster similarity and inter-cluster dissimilarity. A higher DI indicates a better clustering solution, as the clusters are more separated and more defined.

In the next section, we will see how to calculate the DI using Python and some examples of data sets.

4. How to interpret Davies-Bouldin index and Dunn index?

In this section, we will see how to interpret the Davies-Bouldin index (DBI) and the Dunn index (DI) and what are their advantages and disadvantages. As we saw in the previous sections, the DBI and the DI are cluster validity indices that measure the quality of a clustering solution based on the trade-off between intra-cluster similarity and inter-cluster dissimilarity. A lower DBI and a higher DI indicate a better clustering solution, as the clusters are more separated and more defined.

However, these indices are not perfect and have some limitations. Let’s see some of the pros and cons of using the DBI and the DI for cluster evaluation.

Some of the advantages of using the DBI and the DI are:

  • They are easy to calculate and understand, as they use simple formulas and intuitive concepts.
  • They are applicable to any clustering algorithm and any data set, as they do not depend on any specific assumptions or parameters.
  • They are useful for comparing different clustering solutions and choosing the best one among them, as they provide a single numerical value that reflects the cluster validity.
  • They are useful for determining the optimal number of clusters for a data set, as they show how the cluster validity changes with the number of clusters.

Some of the disadvantages of using the DBI and the DI are:

  • They are sensitive to outliers and noise, as they can affect the within-cluster and between-cluster distances and distort the cluster validity.
  • They are sensitive to the cluster shape and size, as they assume that the clusters are spherical and compact, which may not be true for some data sets.
  • They are not sufficient to evaluate the clustering solution, as they do not consider other factors such as the domain knowledge, the data distribution, and the clustering objectives.
  • They are not consistent with each other, as they use different ways to measure the cluster validity and may give conflicting results for some clustering solutions.

Therefore, the DBI and the DI are useful tools to measure the cluster validity, but they are not the only ones. You should also use other methods to evaluate your clustering solution, such as visual inspection, domain knowledge, and external validation. You should also be aware of the limitations of the DBI and the DI and use them with caution and interpretation.

In the next section, we will see some examples of how to use the DBI and the DI for clustering problems and how to interpret their results.

4.1. Davies-Bouldin index interpretation

In this section, we will see how to interpret the Davies-Bouldin index (DBI) and what it tells us about the quality of a clustering solution. As we saw in the previous section, the DBI is calculated as follows:

$$DBI = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{s_i + s_j}{d_{ij}} \right)$$

where:

  • $k$ is the number of clusters
  • $s_i$ is the average within-cluster distance of cluster $i$
  • $s_j$ is the average within-cluster distance of cluster $j$
  • $d_{ij}$ is the between-cluster distance of cluster $i$ and $j$

The DBI is the average of the maximum similarity ratios for each cluster. The similarity ratio is the sum of the within-cluster distances divided by the between-cluster distance. A high similarity ratio means that the clusters are not well-separated or well-defined. Therefore, a lower DBI indicates a better clustering solution.

How do we interpret the DBI value? There is no definitive answer, as the DBI value depends on the data set, the clustering algorithm, and the number of clusters. However, some general guidelines are:

  • A DBI value close to zero means that the clusters are very separated and very defined, which is ideal.
  • A DBI value close to one means that the clusters are moderately separated and moderately defined, which is acceptable.
  • A DBI value greater than one means that the clusters are poorly separated and poorly defined, which is undesirable.

We can use the DBI value to compare different clustering solutions and choose the one with the lowest DBI value, as it indicates the highest cluster validity. However, we should also be aware of the limitations of the DBI, such as its sensitivity to outliers, noise, cluster shape, and size. We should also use other methods to evaluate our clustering solution, such as visual inspection, domain knowledge, and external validation.

In the next section, we will see how to interpret the Dunn index and what it tells us about the quality of a clustering solution.

4.2. Dunn index interpretation

In this section, we will see how to interpret the Dunn index (DI) and what it tells us about the quality of a clustering solution. As we saw in the previous section, the DI is calculated as follows:

$$DI = \frac{\min_{i \neq j} d_{ij}}{\max_{k} s_k}$$

where:

  • $d_{ij}$ is the between-cluster distance of cluster $i$ and $j$
  • $s_k$ is the average within-cluster distance of cluster $k$

The DI is the ratio of the minimum between-cluster distance and the maximum within-cluster distance. A high DI means that the clusters are well-separated and well-defined. Therefore, a higher DI indicates a better clustering solution.

How do we interpret the DI value? There is no definitive answer, as the DI value depends on the data set, the clustering algorithm, and the number of clusters. However, some general guidelines are:

  • A DI value close to infinity means that the clusters are very separated and very defined, which is ideal.
  • A DI value close to one means that the clusters are moderately separated and moderately defined, which is acceptable.
  • A DI value close to zero means that the clusters are poorly separated and poorly defined, which is undesirable.

We can use the DI value to compare different clustering solutions and choose the one with the highest DI value, as it indicates the highest cluster validity. However, we should also be aware of the limitations of the DI, such as its sensitivity to outliers, noise, cluster shape, and size. We should also use other methods to evaluate our clustering solution, such as visual inspection, domain knowledge, and external validation.

In the next section, we will see some examples of how to use the DI for clustering problems and how to interpret their results.

5. How to use Davies-Bouldin index and Dunn index for clustering problems?

In this section, we will see some examples of how to use the Davies-Bouldin index (DBI) and the Dunn index (DI) for clustering problems and how to interpret their results. We will use Python and some popular libraries such as NumPy, Pandas, Matplotlib, and Scikit-learn to perform the clustering and calculate the indices. We will also use some synthetic and real-world data sets to illustrate the concepts.

Before we start, let’s import the necessary libraries and define some helper functions to calculate the DBI and the DI. We will use the following code:

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances

# Define helper functions
def within_cluster_distance(X, labels, centroids):
    # Calculate the average distance of each data point to its cluster centroid
    distances = pairwise_distances(X, centroids, metric='euclidean')
    min_distances = np.min(distances, axis=1)
    return np.mean(min_distances[labels == i]) for i in range(centroids.shape[0])]

def between_cluster_distance(centroids):
    # Calculate the distance between each pair of cluster centroids
    distances = pairwise_distances(centroids, metric='euclidean')
    return distances

def davies_bouldin_index(X, labels):
    # Calculate the Davies-Bouldin index
    n_clusters = len(np.unique(labels))
    centroids = np.array([X[labels == i].mean(axis=0) for i in range(n_clusters)])
    s = within_cluster_distance(X, labels, centroids)
    d = between_cluster_distance(centroids)
    r = np.zeros((n_clusters, n_clusters))
    for i in range(n_clusters):
        for j in range(n_clusters):
            if i != j:
                r[i][j] = (s[i] + s[j]) / d[i][j]
    dbi = np.mean(np.max(r, axis=1))
    return dbi

def dunn_index(X, labels):
    # Calculate the Dunn index
    n_clusters = len(np.unique(labels))
    centroids = np.array([X[labels == i].mean(axis=0) for i in range(n_clusters)])
    s = within_cluster_distance(X, labels, centroids)
    d = between_cluster_distance(centroids)
    di = np.min(d) / np.max(s)
    return di

Now, let’s see some examples of how to use these functions and interpret the results.

5.1. Davies-Bouldin index example

In this section, we will see an example of how to use the Davies-Bouldin index (DBI) for a clustering problem and how to interpret the result. We will use a synthetic data set with four clusters and apply the K-means algorithm with different numbers of clusters. We will then calculate the DBI for each clustering solution and compare them.

First, let’s import the necessary libraries and generate the data set. We will use the following code:

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances

# Define helper functions
def within_cluster_distance(X, labels, centroids):
    # Calculate the average distance of each data point to its cluster centroid
    distances = pairwise_distances(X, centroids, metric='euclidean')
    min_distances = np.min(distances, axis=1)
    return np.mean(min_distances[labels == i]) for i in range(centroids.shape[0])]

def between_cluster_distance(centroids):
    # Calculate the distance between each pair of cluster centroids
    distances = pairwise_distances(centroids, metric='euclidean')
    return distances

def davies_bouldin_index(X, labels):
    # Calculate the Davies-Bouldin index
    n_clusters = len(np.unique(labels))
    centroids = np.array([X[labels == i].mean(axis=0) for i in range(n_clusters)])
    s = within_cluster_distance(X, labels, centroids)
    d = between_cluster_distance(centroids)
    r = np.zeros((n_clusters, n_clusters))
    for i in range(n_clusters):
        for j in range(n_clusters):
            if i != j:
                r[i][j] = (s[i] + s[j]) / d[i][j]
    dbi = np.mean(np.max(r, axis=1))
    return dbi

# Generate data set
np.random.seed(42)
X1 = np.random.normal(0, 1, (100, 2))
X2 = np.random.normal(5, 1, (100, 2))
X3 = np.random.normal(10, 1, (100, 2))
X4 = np.random.normal(15, 1, (100, 2))
X = np.concatenate((X1, X2, X3, X4), axis=0)
y = np.array([0] * 100 + [1] * 100 + [2] * 100 + [3] * 100)

# Plot data set
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='rainbow')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic data set with four clusters')
plt.show()

In the output, you can see that the data set has four well-separated and well-defined clusters. Now, let’s apply the K-means algorithm with different numbers of clusters and calculate the DBI for each clustering solution. We will use the following code:

# Apply K-means with different numbers of clusters
k_values = [2, 3, 4, 5, 6]
dbi_values = []
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    labels = kmeans.labels_
    dbi = davies_bouldin_index(X, labels)
    dbi_values.append(dbi)

# Plot DBI values
plt.figure(figsize=(8, 6))
plt.plot(k_values, dbi_values, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Davies-Bouldin index')
plt.title('DBI values for different numbers of clusters')
plt.show()

In the output, you can see that the DBI value is the lowest when the number of clusters is four, which is the true number of clusters in the data set. This means that the clustering solution with four clusters has the highest cluster validity, as the clusters are well-separated and well-defined. The DBI value increases as the number of clusters deviates from four, which means that the clustering solutions with other numbers of clusters have lower cluster validity, as the clusters are less separated and less defined.

Therefore, we can use the DBI value to evaluate and compare different clustering solutions and choose the one with the lowest DBI value, as it indicates the highest cluster validity. We can also use the DBI value to determine the optimal number of clusters for a data set, as it shows how the cluster validity changes with the number of clusters.

In the next section, we will see an example of how to use the Dunn index for a clustering problem and how to interpret the result.

5.2. Dunn index example

In this section, we will see an example of how to use the Dunn index (DI) for a clustering problem and how to interpret the result. We will use the same synthetic data set with four clusters and apply the K-means algorithm with different numbers of clusters. We will then calculate the DI for each clustering solution and compare them.

First, let’s import the necessary libraries and generate the data set. We will use the same code as in the previous section:

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances

# Define helper functions
def within_cluster_distance(X, labels, centroids):
    # Calculate the average distance of each data point to its cluster centroid
    distances = pairwise_distances(X, centroids, metric='euclidean')
    min_distances = np.min(distances, axis=1)
    return np.mean(min_distances[labels == i]) for i in range(centroids.shape[0])]

def between_cluster_distance(centroids):
    # Calculate the distance between each pair of cluster centroids
    distances = pairwise_distances(centroids, metric='euclidean')
    return distances

def davies_bouldin_index(X, labels):
    # Calculate the Davies-Bouldin index
    n_clusters = len(np.unique(labels))
    centroids = np.array([X[labels == i].mean(axis=0) for i in range(n_clusters)])
    s = within_cluster_distance(X, labels, centroids)
    d = between_cluster_distance(centroids)
    r = np.zeros((n_clusters, n_clusters))
    for i in range(n_clusters):
        for j in range(n_clusters):
            if i != j:
                r[i][j] = (s[i] + s[j]) / d[i][j]
    dbi = np.mean(np.max(r, axis=1))
    return dbi

def dunn_index(X, labels):
    # Calculate the Dunn index
    n_clusters = len(np.unique(labels))
    centroids = np.array([X[labels == i].mean(axis=0) for i in range(n_clusters)])
    s = within_cluster_distance(X, labels, centroids)
    d = between_cluster_distance(centroids)
    di = np.min(d) / np.max(s)
    return di

# Generate data set
np.random.seed(42)
X1 = np.random.normal(0, 1, (100, 2))
X2 = np.random.normal(5, 1, (100, 2))
X3 = np.random.normal(10, 1, (100, 2))
X4 = np.random.normal(15, 1, (100, 2))
X = np.concatenate((X1, X2, X3, X4), axis=0)
y = np.array([0] * 100 + [1] * 100 + [2] * 100 + [3] * 100)

# Plot data set
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='rainbow')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic data set with four clusters')
plt.show()

In the output, you can see that the data set has four well-separated and well-defined clusters. Now, let’s apply the K-means algorithm with different numbers of clusters and calculate the DI for each clustering solution. We will use the same code as in the previous section:

# Apply K-means with different numbers of clusters
k_values = [2, 3, 4, 5, 6]
di_values = []
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    labels = kmeans.labels_
    di = dunn_index(X, labels)
    di_values.append(di)

# Plot DI values
plt.figure(figsize=(8, 6))
plt.plot(k_values, di_values, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Dunn index')
plt.title('DI values for different numbers of clusters')
plt.show()

In the output, you can see that the DI value is the highest when the number of clusters is four, which is the true number of clusters in the data set. This means that the clustering solution with four clusters has the highest cluster validity, as the clusters are well-separated and well-defined. The DI value decreases as the number of clusters deviates from four, which means that the clustering solutions with other numbers of clusters have lower cluster validity, as the clusters are less separated and less defined.

Therefore, we can use the DI value to evaluate and compare different clustering solutions and choose the one with the highest DI value, as it indicates the highest cluster validity. We can also use the DI value to determine the optimal number of clusters for a data set, as it shows how the cluster validity changes with the number of clusters.

In the next section, we will see a real-world example of how to use the DBI and the DI for a clustering problem and how to interpret the results.

6. Conclusion

In this blog, we have learned how to use Davies-Bouldin index and Dunn index for clustering problems. We have seen what these indices are, how they measure cluster validity, how to calculate them, how to interpret them, and how to use them for different data sets. We have also seen some examples of how to apply these indices using Python and some popular libraries.

Here are some key points to remember:

  • Davies-Bouldin index and Dunn index are cluster validity indices that measure the quality of a clustering solution based on intra-cluster similarity and inter-cluster dissimilarity.
  • Davies-Bouldin index uses the ratio of intra-cluster similarity and inter-cluster dissimilarity, while Dunn index uses the ratio of inter-cluster dissimilarity and intra-cluster similarity.
  • A lower Davies-Bouldin index indicates a better clustering solution, while a higher Dunn index indicates a better clustering solution.
  • We can use these indices to evaluate and compare different clustering solutions and choose the one with the highest cluster validity.
  • We can also use these indices to determine the optimal number of clusters for a data set, by looking for the elbow point where the index value changes significantly.
  • These indices have some limitations, such as their sensitivity to outliers, noise, cluster shape, and size. We should also use other methods to evaluate our clustering solution, such as visual inspection, domain knowledge, and external validation.

We hope that this blog has helped you understand how to use Davies-Bouldin index and Dunn index for clustering problems and how to interpret their results. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *