K Means Clustering


This sounds a bit like a heavy data science jargon with K-means?! and Clustering?! in it but no need to panic as it’s just another fancy term from DScientist’s back pocket. So to understand the intuition behind it, let’s just “.split()” the term, try to “.describe()” each token and the questions arising with it.

What is Clustering?-It is an unsupervised machine learning technique of dividing data into distinct groups such that observations within each group are similar. Here every group is termed a cluster.
What is Unsupervised Machine Learning?-Type of machine learning where an output is not known?We try to learn relationships, structure and underlying pattern from the given input data.

What is K?-It is the number of how many clusters you want to form from the data. The ‘-means’ part of it will be explained in greater details further ahead.

So what is K-means Clustering finally? -It is an unsupervised machine learning algorithm that will attempt to group similar clusters together in your data.

Now that we got the basic idea, let’s get into the nitty-gritty of the topic.

K-means Algorithm:
Research says that graphical information is easier to remember than textual information. So let’s learn with pictures as they have better Recall!

All right here we’ve got a scatter plot. Say we have two variables plotted along the X-axis and Y-axis from our data set (before K-means graph). Question is can we identify certain groups among our variables and how would we go about doing it. How do we identify a number of groups?

So what the K-means does for you is that it takes the complexity from this decision-making process and allows you to easily identify those clusters of data points from a dataset(after K-means graph). There you go, we got three clusters(red, blue, green). It was a very simplified example where we only have two variables here, so two dimensions but note that K-means can work on multi-dimensional data.

To know how it works…please refer to the below link: