This article explains clustering as a term and as a concept within the AI & Analytics Engine: Clustering is a set of techniques in Machine Learning that can automatically discover groups of similar entities from a dataset and segment it accordingly.
Clustering is a set of techniques in machine learning
Clustering can automatically discover groups of similar entities from a dataset and segment it accordingly.
💡How many clusters are there in the input dataset on the left? A good clustering technique will produce results as seen on the right side. Colours indicate the cluster ID assigned to each entity in the data.
Unlike classification and regression where the goal is to build a model that predicts a specified column in the dataset, there is no column to predict in clustering. Instead, clustering is used to discover and describe patterns in data by analyzing similarities to find the most coherent groupings automatically.
Clustering thus provides a powerful way to generate highly valuable and actionable insights from datasets of any size. Applications of clustering in the real world include: Demographic and behavioural segmentation of customers, product recommendation, market research, and biological data analysis, among others.
Clustering within the AI & Analytics Engine
On the AI & Analytics Engine, clustering is supported as an App type. To use clustering, you need to provide the “problem description” as an input, which includes:
The dataset for which clustering needs to be performed;
The columns that must be taken into consideration while determining the similarity between items, and
The algorithm to use and its configurations.
The clustering builder pipeline
The clustering pipeline consists of the steps and is accessible via the project summary page, simply select "Build from scratch" then "Clustering".
App problem type selection, clustering
Add a clustering-ready dataset or one or more datasets needed to prepare a clustering-ready dataset.
Clustering app builder pipeline
Use the prepared dataset as clustering input.
Clustering app builder pipeline, using the prepared dataset
Select features to be used to determine the similarity between items, then confirm the types of these features.
Clustering app builder pipeline, selecting the features
Choose and configure clustering algorithms.
Clustering app builder pipeline, configure algorithms
Generating and interpreting cluster results
Users can generate multiple clustering results using different algorithms and configurations, based on the same columns. Under each clustering result, one can:
Examine an overview, detailed analysis, and insights about each cluster.
Generate the output as a dataset with a cluster ID and other associated columns to your project, or externally export the result as a file or as a table in an external database.
🎓Learn more about how to export your clustering results.
Overview: Summary and clustering quality
Clustering quality visualization
Analysis: Important dimensions for each cluster
Detailed descriptions of clusters
Detailed cluster descriptions
🎓Now you understand what clustering is, follow along with this guide: How to build a clustering pipeline.