How to analyze clustering output?

This article walks you through the steps to analyze the clustering output.

Step 1 - Examine output quality

Select up to 12 clusters to view their visualization in the UMAP dimension-reduced dataset

  • If the selected clusters can be easily distinguished in the UMAP, the output is of good quality

  • On the opposite, if there is a lot of overlapping between the selected clusters, the output is of poor quality

Example of a good output quality Good output quality

Example of a fair output quality Fair output quality (not poor, still workable)

Step 2 - Examine the cluster profile overview

Select up to 12 clusters to view the heatmap and compare different cluster profiles

  • The y-axis displays the cluster ID

  • The x-axis displays the most important features.

  • The coloring represents the average value of a feature on the x-axis in the cluster in the y-axis. A clearly distinguishable color difference for a feature across different clusters indicates a strong difference in average values for that feature across the different clusters. For example, the average value of Age in cluster 3 is 33, and in cluster 1 is 55. So, the cell for Age in cluster 3 is lighter than that in cluster 1.

Clustering profile

Step 3 - Examine the detailed description of selected clusters

Select the cluster of interest using the dropdown. This will show a detailed description of the selected cluster generated automatically by the Engine’s AI capabilities:

Selecting a cluster to see detailed description pageSelecting a cluster to see detailed description.

In the generation of the detailed description, the Engine’s AI makes a good compromise between the simplicity (or interpretability) of the description and the accuracy.

The quality of the AI-generated cluster descriptions is indicated by two scores: Coverage and exclusivity, both of which should be close to 100%.

The coverage metric tells us what proportion of the items in the target cluster are fitting the generated description. The exclusivity metric tells us what proportion of items from the dataset fitting the description are actually from the target cluster. A description that is insufficient will show a low coverage, while one that is too generic will show a low exclusivity.

The Engine then tells us whether a cluster can be described in a unified way, or whether it is best to split it into two or more sub-divisions for a better description. The profile of each sub-division is then given in terms of the columns used for clustering and what value(s) they typically take for the selected cluster, in a user-friendly manner:

Clustering sub-divisions pageThis cluster is best described using two sub-divisions. The sizes of each sub-division are as indicated. Each sub-division is described in a user-friendly readable way.

Users who are curious for further detail can click on the blue “View Details” link which will open up a series of visualizations showing how each line in the description makes it more concentrated on the selected cluster:

Sub-division 1 pageDetailed view of sub-division 1

 Sub-division 2 page
Detailed view of sub-division 2