What is clustering?

This article explains clustering as a term and as a concept within the AI & Analytics Engine: Clustering is a set of techniques in Machine Learning that can automatically discover groups of similar entities from a dataset and segment it accordingly.

Clustering is a set of techniques in machine learning

Clustering can automatically discover groups of similar entities from a dataset and segment it accordingly.

what_is_clustering_graphic

💡How many clusters are there in the input dataset on the left? A good clustering technique will produce results as seen on the right side. Colours indicate the cluster ID assigned to each entity in the data.

Unlike classification and regression where the goal is to build a model that predicts a specified column in the dataset, there is no column to predict in clustering. Instead, clustering is used to discover and describe patterns in data by analyzing similarities to find the most coherent groupings automatically.

Clustering thus provides a powerful way to generate highly valuable and actionable insights from datasets of any size. Applications of clustering in the real world include: Demographic and behavioural segmentation of customers, product recommendation, market research, and biological data analysis, among others.

Clustering within the AI & Analytics Engine 

On the AI & Analytics Engine, clustering is supported as an App type. To use clustering, you need to provide the “problem description” as an input, which includes:

  • The dataset for which clustering needs to be performed;

  • The columns that must be taken into consideration while determining the similarity between items, and

  • The algorithm to use and its configurations.

The clustering builder pipeline 

The clustering pipeline consists of the steps and is accessible via the project summary page, simply select "Build from scratch" then "Clustering".

clustering_guide_choose_clusteringApp problem type selection, clustering 

Add a clustering-ready dataset or one or more datasets needed to prepare a clustering-ready dataset.

clustering_guide_telco_step1_add_dataClustering app builder pipeline

Use the prepared dataset as clustering input.

clustering_guide_telco_use_as_clsuteing_input_feature_selectionClustering app builder pipeline, using the prepared dataset

Select features to be used to determine the similarity between items, then confirm the types of these features.

release_notes_1.12.0_clustering_featuresClustering app builder pipeline, selecting the features

Choose and configure clustering algorithms.

what_is_clusting_algorithim_configClustering app builder pipeline, configure algorithms

Generating and interpreting cluster results

Users can generate multiple clustering results using different algorithms and configurations, based on the same columns. Under each clustering result, one can:

  • Examine an overview, detailed analysis, and insights about each cluster.

  • Generate the output as a dataset with a cluster ID and other associated columns to your project, or externally export the result as a file or as a table in an external database.

🎓Learn more about how to export your clustering results.

Overview: Summary and clustering quality 

release_notes_1.12.0_clusterin_output_visual_clustersClustering quality visualization

Analysis: Important dimensions for each cluster

release_notes_1.12.0_clustering_output_visual_squareCluster profiles

Detailed descriptions of clusters

what_is_clustering_detailed_cluster_profilesDetailed cluster descriptions

🎓Now you understand what clustering is, follow along with this guide: How to build a clustering pipeline.