No code data science

Unsupervised Machine Learning: K-Means Clustering


It’s now time to explore a different domain of machine learning. The popular unsupervised machine learning algorithms: K-Means clustering.

Having discussed in detail supervised machine learning algorithms such as linear regression, decision trees, XGBoost and neural networks, we now shift our focus to one of the most popular unsupervised machine learning algorithms: K-Means clustering.

What is Unsupervised Learning?

In unsupervised learning, the data is not labeled. Therefore, we do not know the target variable. We have a bunch of features and instead of trying to predict something, our goal is to group or club similar observations. 

What is Clustering?

Clustering groups observations that have similar properties or characteristics. This helps us to unearth hidden patterns and structures in the data.

Observations within a cluster are more similar to each other than observations that belong to different clusters. Conversely, data points in different clusters should be as different as possible to get the best results.

K-Means Clustering: The Algorithm

The K-Means clustering algorithm is an iterative clustering algorithm that assigns each observation in a dataset to exactly one cluster of the K number of clusters that we specify in advance before running the algorithm.

The main objective of the K-Means algorithm is to minimize the sum of squared distances between the observations in a cluster and their respective cluster centroid. The centroid of the cluster is the mean value of all the values in the cluster.

Here we list the steps to demonstrate how the algorithm works behind the scenes:

  1. Choose the number of clusters K
  2. Select K random points from the data as centroids
  3. Randomly assign each data point to one of the K clusters
  4. Calculate the squared sum between each data point and all centroids.
  5. Reassign each data point to the closest centroid based on the computation from step 4.
  6. Recompute the centroids of newly formed clusters
  7. Repeat steps 4, 5, and 6 until we no longer have to change anything in the clusters

kmeans clustering

The Iris Dataset

We are going to implement the K-Means algorithm on the AI & Analytics platform using the Iris data set. This data consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). It has four features from each sample: length and width of sepals and petals. Although the data set is labeled where we know the target variable, we are going to drop the class column and treat it as an unsupervised machine learning problem.

Python Implementation

You can access the complete jupyter notebook with the code here

AI & Analytics Engine: No-Code Implementation

Our head of data science has created a short and sweet tutorial to give you a walk-through of how to implement the K-Means Clustering algorithm on the AI & Analytics Engine using the Iris data set.

You do not have to write even a single line of code for this. To learn more about getting started with no code data science, have a look at this article

 

Wrap-Up

We explored the K-Means clustering algorithm in this blog which is one of the most intuitive and widely used algorithms in unsupervised machine learning. It is computationally efficient and the results are also easy to visualize. The algorithm was then implemented on the AI & Analytics Engine platform using the Iris data set which took only a few minutes and did not require any coding.

 

Ready to give K-Means Clustering a try for yourself? Simply create a trial account with the Engine

Free Trial

Similar posts