Performance Metrics in Machine Learning — Part 3: Clustering

Using the right performance metric for the right task

Eugenio "Jay" Zuccarelli
Towards Data Science

--

Source: Florian Schmetz

In the first two parts of this series, we explored the main types of performance metrics used to evaluate Machine Learning models. These covered the two major types of ML tasks, Classification and Regression. While this type of tasks make up of most of the usual applications, another key category exists: Clustering.

To read the first two parts of the series, follow these links:

While Classification and Regression tasks form what’s called Supervised Learning, Clustering forms the majority of Unsupervised Learning tasks. The difference between these two macro-areas lies in the type of data used. While in Supervised Learning samples are labelled with either a categorical label (Classification) or a numerical value (Regression), in Unsupervised Learning samples are not labelled, making it a relatively complex task to perform and evaluate.

Correctly measuring the performance of Clustering algorithms is key. This is especially true as it often happens that clusters are manually and qualitatively inspected to determine whether the results are meaningful.

In the third part of this series, we will go through the main metrics used to evaluate the performance of Clustering algorithms, to rigorously have a set of measures.

Clustering

Silhouette Score

The Silhouette Score and Silhouette Plot are used to measure the separation distance between clusters. It displays a measure of how close each point in a cluster is to points in the neighbouring clusters. This measure has a range of [-1, 1] and is a great tool to visually inspect the similarities within clusters and differences across clusters.

The Silhouette Score is calculated using the mean intra-cluster distance (i) and the mean nearest-cluster distance (n) for each sample. The Silhouette Coefficient for a sample is (n - i) / max(i, n).

n is the distance between each sample and the nearest cluster that the sample is not a part of while i is the mean distance within each cluster.

The typical Silhouette Plots represent the cluster label on the y-axis, while the actual Silhouette Score on the x-axis. The size/thickness of the silhouettes is also proportional to the number of samples inside that cluster.

An example Silhouette Plot. On the y-axis, each value represents a cluster while the x-axis represents the Silhouette Coefficient/Score.

The higher the Silhouette Coefficients (the closer to +1), the further away the cluster’s samples are from the neighbouring clusters samples. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighbouring clusters. Negative values, instead, indicate that those samples might have been assigned to the wrong cluster. Averaging the Silhouette Coefficients, we can get to a global Silhouette Score which can be used to describe the entire population’s performance with a single value.

Let’s try to understand how the Silhouette Plot can help us find the best number of clusters by looking at the performance of each configuration:

Using “K=2”, meaning two clusters to separate the population, we achieve an average Silhouette Score of 0.70.

Increasing the number of clusters to three, the average Silhouette Score drops a bit.

The same happens as the number of clusters increases. It can also be noticed that the thickness of the silhouettes keeps decreasing as the number of clusters increases, because there are less samples in each cluster.

Overall, the average Silhouette Scores are:

For n_clusters = 2 The average silhouette_score is : 0.70
For n_clusters = 3 The average silhouette_score is : 0.59
For n_clusters = 4 The average silhouette_score is : 0.65
For n_clusters = 5 The average silhouette_score is : 0.56
For n_clusters = 6 The average silhouette_score is : 0.45

Having calculated the Silhouette Score for each possible configuration up to K=6, we can see that the best number of clusters is two, according to this metric and the higher the number of clusters the worse the performance becomes.

To calculate the Silhouette Score in Python, you can simply use Sklearn and do:

sklearn.metrics.silhouette_score(X, labels, *, metric='euclidean', sample_size=None, random_state=None, **kwds)

The function takes as input:

  • X: An array of pairwise distances between samples, or a feature array, if the parameter “precomputed” is set to False.
  • labels: A set of labels representing the label that each sample is assigned to.

Rand Index

Another commonly used metric is the Rand Index. It computes a similarity measure between two clusters by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

The formula of the Rand Index is:

The RI can range from zero to 1, a perfect match.

The only drawback of Rand Index is that it assumes that we can find the ground-truth clusters labels and use them to compare the performance of our model, so it is much less useful than the Silhouette Score for pure Unsupervised Learning tasks.

To calculate the Rand Index:

sklearn.metrics.rand_score(labels_true, labels_pred)

Adjusted Rand Index

Rand index adjusted for chance.

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:

The Adjusted Rand Index, similarly to RI, ranges from zero to one, with zero equating to random labelling and one when the clusters are identical.

Similarly to RI, to calculate the ARI:

sklearn.metrics.adjusted_mutual_info_score(labels_true, labels_pred, *, average_method='arithmetic')

Mutual Information

The Mutual Information is another metric often used in evaluating the performance of Clustering algorithms. It is a measure of the similarity between two labels of the same data. Where |Ui| is the number of the samples in cluster Ui and |Vj| is the number of the samples in cluster Vj, the Mutual Information between clusters U and V is given as:

Similarly to Rand Index, one of the major drawbacks of this metric is requiring to know the ground truth labels a priori for the distribution. Something which is almost never true in real-life scenarios with Custering.

Using Sklearn:

sklearn.metrics.mutual_info_score(labels_true, labels_pred, *, contingency=None)

Calinski-Harabasz Index

Calinski-Harabasz Index is also known as the Variance Ratio Criterion.

The score is defined as the ratio between the within-cluster dispersion and the between-cluster dispersion. The C-H Index is a great way to evaluate the performance of a Clustering algorithm as it does not require information on the ground truth labels.

The higher the Index, the better the performance.

The formula is:

where tr(Bk) is trace of the between group dispersion matrix and tr(Wk) is the trace of the within-cluster dispersion matrix defined by:

To calculate it with Python:

sklearn.metrics.calinski_harabasz_score(X, labels)

Davies-Bouldin Index

The Davies-Bouldin Index is defined as the average similarity measure of each cluster with its most similar cluster. Similarity is the ratio of within-cluster distances to between-cluster distances. In this way, clusters which are farther apart and less dispersed will lead to a better score.

The minimum score is zero, and differently from most performance metrics, the lower values the better clustering performance.

Similarly to the Silhouette Score, the D-B Index does not require the a-priori knowledge of the ground-truth labels, but has a simpler implementation in terms of fomulation than Silhouette Score.

sklearn.metrics.davies_bouldin_score(X, labels)

Summary

To summarise, in the third part of the series we analysed the Clustering performance metrics, focusing in particular on:

  • Silhouette Score
  • Rand Index
  • Adjusted Rand Index
  • Mutual Information
  • Calinski-Harabasz Index
  • Davies-Bouldin Index

To read more articles like this, follow me on Twitter, LinkedIn or my Website.

--

--

Data Science for Fortune 100 | Forbes 30 Under 30 | Fortune 40 Under 40 | MIT, Harvard, Imperial College | Follow on Socials as @jayzuccarelli