1️⃣K-Means Clustering

K-Means Clustering is a type of unsupervised learning algorithm that is used to classify a set of n-dimensional points into k clusters, where k is a user-specified parameter. The goal of the algorithm is to group similar data points together and separate dissimilar data points.

The algorithm works by first initializing k centroids randomly, which are points representing the center of each cluster. The algorithm then assigns each data point to the cluster whose centroid is closest to it based on some distance metric, typically the Euclidean distance. Once all the points have been assigned to clusters, the centroids are recalculated as the mean of all the points in each cluster. This process is repeated until the centroids no longer change or a maximum number of iterations is reached.

The K-Means algorithm is sensitive to the initial positions of the centroids, so it's commonly run multiple times with different initial positions and the best result is chosen.

A key parameter in K-Means is the number of clusters, k, which needs to be specified in advance. A common approach to determine the value of k is to run the algorithm for different values of k and select the one that gives the best result based on some evaluation metric like the silhouette score, elbow method or gap statistics.

K-means is a widely used algorithm for clustering, it is simple to understand, easy to implement and computationally efficient for large datasets. However, K-means has some limitations as well, it assumes that all clusters have similar size and spherical shapes which may not be the case in real-world problems, it also assumes that the data is isotropic and the clusters are equally sized. Also, it is sensitive to initial conditions, noise and outliers.

In order to overcome these limitations, other methods such as Hierarchical Clustering, DBSCAN, and Gaussian Mixture Model can be used.

Analogy:

An analogy for K-Means Clustering is to think of it as a way to separate a group of items into different categories based on their similarities. Imagine you have a collection of items, such as books, and you want to classify them into different groups based on their genre. The K-Means algorithm would work by first randomly selecting a few books as representative examples of each genre (the centroids), and then iteratively assigning each book to the genre whose representative example is most similar to it (the closest centroid). After each iteration, the representative examples of each genre are updated to better match the books that have been assigned to them.

Another analogy would be to think of it as a way to divide a group of people into different groups based on their characteristics. For example, you have a group of people with different ages, incomes, and educational levels. You could use K-Means to divide this group into different clusters, such as young adults, middle-aged individuals and seniors, high-income, medium-income and low-income, or high-education, medium-education and low-education.

In both examples, the goal is to group similar items or people together, and separate dissimilar items or people.

Example

In this example, we first generate some synthetic data using the numpy library. Next, we create an instance of the KMeans class with 3 clusters and a random seed of 0. We then fit the model to the data using the fit method and make predictions on the same data using the predict method.

K-means is a simple and widely used clustering algorithm, it's fast, easy to understand and implement. However, it's sensitive to initial conditions and the number of clusters needs to be specified in advance. It also assumes that clusters have similar size and have spherical shapes which might not be the case in real-world problem

Python code

from sklearn.cluster import KMeans
import numpy as np

# Generate some synthetic data
X = np.random.rand(100, 2)

# Create the model
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

# Predict the cluster for each data point
y_pred = kmeans.predict(X)
print(y_pred)

Output

[2 2 1 1 1 0 1 1 0 0 0 2 1 2 1 0 0 1 2 0 1 2 2 2 0 2 1 1 0 0 0 0 1 2 2 1 1
 1 0 2 1 2 2 1 1 1 0 0 0 1 0 0 0 2 1 1 0 1 1 1 2 2 0 2 1 1 0 1 1 2 2 1 2 1
 0 1 0 1 1 0 2 0 2 2 0 1 2 1 2 2 2 2 2 2 2 0 1 1 1 1]

References

https://towardsdatascience.com/k-means-a-complete-introduction-1702af9cd8c

Last updated