This blog demonstrates the clustering feature in the AI & Analytics Engine, applied on a database of Pokémon character statistics.
Understanding Clustering in Machine Learning: Algorithms and Use Cases
What is clustering in machine learning?
Clustering is an unsupervised machine learning method, where datapoints are organized into groups, or clusters, consisting of similar datapoints. This divides them such that each datapoint has less or no similarities with another cluster.
Clustering is a task performed by specific machine learning algorithms, which scan a dataset, and places each datapoint it in a cluster of other datapoints with similar features.
Clustering vs classification
Clustering is similar to classification in that it identifies patterns within data. However, classification is a supervised learning method, where training data is required to be labeled beforehand, and clustering is unsupervised and does not require labeled data.
One way to think of the distinction is that clustering divides data into natural groupings based on all its features, whereas classification uses the features to predict its class.
We’ve mentioned the fact that clustering is an unsupervised machine learning method, but what does that mean? Unsupervised machine learning algorithms are self-learning and learn from data without human supervision. It uses an unlabeled dataset and infers the structure based on similarities, differences, and patterns within the data.
Because clustering is unsupervised, is often used in exploratory data analysis to break down complex data and discover new patterns.
Types of clustering
Clustering can be “clustered” into two main categories; Hard clustering results in datapoints belonging entirely to one cluster, and one cluster only. Conversely, soft clustering gives a probability that a datapoint belongs in each predefined cluster.
As with most machine learning methods, there are various clustering algorithms that go about the task in different ways. Different algorithms will often yield significantly different results, even for the same input data.
K-means clustering is one of the most widely used clustering algorithms due to its applicability to a wide range of uses, ease of implementation, and speed.
K-means clustering is a centroid-based algorithm that divides the data into k clusters. Each cluster is assigned a centroid (the center) and each datapoint is assigned to the cluster based on the closest centroid. This process is iterative because the centroid repositions to the center of the new cluster until the final cluster is obtained.
Hierarchical clustering is a connectivity-based algorithm, and as the name suggests, is a method in which a hierarchy, or tree, of clusters is created. It's based on the premise that datapoints closer in proximity are more related than those further away.
Each datapoint starts off as its own cluster at the bottom of the tree, and the two closest datapoints form a cluster according to distance. Hierarchical clustering allows for smaller clusters to be created and doesn’t require specifying the number of clusters, however, it doesn’t handle outliers well, as they can merge with other clusters.
Mean-shift clustering is a density-based algorithm, that iteratively assigns datapoints to clusters, however unlike K-means, it automatically determines the number of clusters. In mean-shift clustering, the datapoints are clustered based on the proximity to a centroid, however, the centroid iteratively moves toward the point of maximum density.
Mean-shift clustering is particularly useful when there is no prior knowledge of the number of clusters required, and when they have arbitrary shapes.
DBSCAN stands for Density-Based Spatial Clustering Applications with Noise and is also known as density-based clustering. DBSCAN works on the premise the high-density spaces (clusters) are separated by low-density spaces (noise).
DBSCAN requires two inputs; Epsilon (eps) which is the distance threshold for two datapoints to be neighbours, and MinPts which specifies the minimum number of neighbours a datapoint must have. This results in datapoints being categorized as;
Core point: Has above MinPts neighbors
Border point: Within the neighborhood of a core point, but below MinPts
Noise/Outlier: Has no neighbours
DBSCAN is advantageous when outliers need to be identified, however can struggle with data that has varying densities.
Clustering use cases
Let's go over a few ways that clustering is being utilized for real-life business applications.
Clustering can be used to segment customers into groups that share similar traits or behaviors. Customer segmentation can be applied to virtually any industry, so long as the company has customer information and behavior data. These segments can be used to tailor marketing and communication for each segment, providing a more personalized customer experience.
Fraud detection is used by financial services to detect and flag suspicious transactions. Fraud detection helps banks recoup losses from transfer or payment fraud, which cost American banks $1.59 billion USD in 2022. Fraud detection is a form of anomaly detection, and clustering is used to determine outlier transactions that differ from expected patterns, as discussed earlier, the DBSCAN algorithm is particularly effective for this use.
Recommendation engines can take many forms; Product recommendations on e-commerce sites, movie suggestions on streaming sites, or even recommended articles from news sites. All these examples use clustering to recommend the most relevant content to a user. K-means clustering is the most common algorithm for this kind of system.
Hopefully, this blog has given you some insight into clustering in machine learning. If you’re looking to start building some clustering models for yourself, consider using the AI & Analytics Engine.
Not sure where to start with machine learning? Reach out to us with your business problem, and we’ll get in touch with how the Engine can help you specifically.