Unsupervised Learning Explained Using K-Means Clustering

March 7, 2023

Different learning methods and patterns are generally associated with the human mind. The visual, auditory, kinesthetic, and reading/writing methods of learning are widely recognized as the four primary methods by which humans learn. The utility of these learning methods varies from person to person. While Jack may learn more effectively by reading a book and writing key points from what he has learned, Jill may learn more effectively by doing and putting what she has learned into action, which is the kinesthetic form of learning.

Machine learning models, like humans, can learn patterns in data in a variety of ways. There are two main methods of learning: supervised and unsupervised learning. These learning methods, similar to humans, may be great for some use cases but may not be as effective when applied to other problems.

In this article, we will look at the various machine learning methods, their differences, and their respective use cases. Then, we’ll take a closer look at how unsupervised learning works by studying the k-means clustering algorithm and implementing it in Python. It is recommended that you are familiar with the Python programming language in order to follow along with this article.

Types of Machine Learning Methods

As previously stated, there are two methods for training a machine learning model: supervised and unsupervised learning.

Supervised Learning

This training method involves feeding labeled data to the machine learning algorithm and allowing it to find patterns in the data. Labeled data is data that has a tag, or, better yet- a description. In essence, the algorithm understands the meaning of the data or its relevance.

Consider data containing images of animals and their names. A picture of a cat, for example, is labeled with the name: cat, as are pictures of dogs, birds, and all other animals. With this information, the algorithm can use the animal names to find similarities between the images. As a result, when tested with an image of a cat it has never seen before, it can infer from previous data that the image presented is of a cat.

There are two types of supervised learning: classification and regression. Predicting a discrete output, such as whether it will rain today or the name of an animal, is what classification is all about. Regression, on the other hand, predicts continuous values such as the price of a house or an employee’s salary. Linear Regression, Logistic Regression, Support Vector Machines, K-Nearest Neighbors, and Decision Trees are examples of supervised learning algorithms. Supervised learning is frequently used to solve problems like weather forecasting, sales forecasting, and stock price analysis.

The next we’ll discuss unsupervised learning.

Unsupervised Learning

Unsupervised learning is the polar opposite of supervised learning. Knowing this, you’ll understand that it entails training machine learning algorithms on unlabeled data. It is unlabeled because it has no tag or description. The goal of unsupervised learning is to find patterns in data and classify it into different sets based on similarities.

Consider another image dataset, but this time without any labels. The unsupervised algorithm finds similarities between these images and categorizes them based on how similar they are. The catch here is that the algorithm is unaware that the group of images are of cats, while the others are of dogs. It simply determines that these images should be grouped together based on their similarities.

Clustering and association are the two types of unsupervised learning. Clustering involves the algorithm grouping similar data points together, such as grouping cats and dogs together because they are animals and grouping books and pencils together because they are stationary objects. Association, on the other hand, measures the likelihood of two things being related.

A good example of this can be illustrated by training an unsupervised algorithm on grocery data. It will most likely group bread and butter together because they are for the most part purchased together. Furthermore, clustering is concerned with samples (rows of data), whereas association is concerned with variables.

K-Means Clustering, Hierarchical Clustering, DBSCAN, and Principal Component Analysis are examples of unsupervised learning algorithms (PCA). These algorithms can be used to solve customer segmentation, churn analysis, cross-selling strategies, and image segmentation problems.

K-Means Clustering: What Is It?

Before delving into the concept of K-Means Clustering, it is necessary to first define a cluster. A cluster is a group of data points that have been brought together due to certain similarities. The value of K chosen determines the number of clusters. There are two methods for determining the value of K. The first is to use trial and error methods, which can take time and may not produce the required accuracy. The Elbow method or Silhouette method are two better ways to select the value of K.

Now, let’s talk about K-Means clustering. K-Means clustering is an unsupervised machine learning algorithm that groups similar data points together into clusters based on similarities. The value of K determines the number of clusters. K-Means clustering is a form of partitional clustering, which separates a data set into sets of separate clusters.

K-Means Clustering: How It Works?

Next let’s understand how it works.

This is how K-Means clustering is executed:

  1. First, the algorithm chooses centroids at random from the data points. The number of centroids is proportional to the value of K.
  2. It then computes the distance between each centroid and all of the data points. This distance can be calculated using a variety of methods, including the Euclidean distance method, the Cosine distance method, the Squared Euclidean distance method, and the Manhattan distance method.
  3. The algorithm then assigns the data points to the centroid that is closest to it. The data points associated with a specific centroid have now formed a cluster.
  4. Because the points were chosen at random at first, it is most likely not the best fit, so the algorithm calculates the average distance between the data points, which becomes the location of the centroid.
  5. When the centroid’s location changes, the algorithm calculates the distance between each centroids and the data points again. It then assigns the data points to the centroids that are closest to them. It repeats this process until the centroid’s location is stable.

Implementing K-Means Clustering In Python

For the Python implementation of K-Means clustering we’ll use the sklearn library to access the K-Means algorithm and the matplotlib library for visualization in this article.

We start by importing the necessary libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Following that, we generate some random data to train the algorithm. The following code generates 1500 data points in 2D space.

# Generating random data
np.random.seed(0)
n_samples = 1500
X = np.random.randn(n_samples, 2)

The next step will be to apply the K-Means algorithm to the above data. The number of clusters is set to 3 in the code below, but you can experiment with different numbers to see what happens.

# Applying K-means clustering
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

Finally, to see the clustering result, use the code below to visualize it:

# Visualizing the results
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap=’viridis’)
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c=’black’, s=200, alpha=0.5);
plt.title(‘K-means clustering’)
plt.show()

Here’s the result for a cluster of 3:

For a cluster of 4, here’s the result:

And lastly for a cluster of 5:

At a certain point, increasing the number of cluster will stretch the clusters too thin for there to be any strong similarity between the data points. Hence it is advisible to use the Elbow or Silhouette methods to find out the optimal number of clusters required.

Conclusion

In this article, we offered an overview of two types of machine learning methods. This helps one better understand what is unsupervised learning with the K-Means clustering algorithm. This algorithm is useful for image segmentation, customer segmentation, and anomaly detection in real-world applications.

However, K-means clustering has limitations, including sensitivity to initial centroid placement and the assumption of equal variance within each cluster. As a result, it is critical to evaluate the clustering algorithm’s performance using appropriate metrics and consider alternative approaches if the results are not satisfactory.

That’s all for now, see you at our next article.✌