1️⃣K-Means Clustering
K-Means Clustering is a type of unsupervised learning algorithm that is used to classify a set of n-dimensional points into k clusters, where k is a user-specified parameter. The goal of the algorithm is to group similar data points together and separate dissimilar data points.
The algorithm works by first initializing k centroids randomly, which are points representing the center of each cluster. The algorithm then assigns each data point to the cluster whose centroid is closest to it based on some distance metric, typically the Euclidean distance. Once all the points have been assigned to clusters, the centroids are recalculated as the mean of all the points in each cluster. This process is repeated until the centroids no longer change or a maximum number of iterations is reached.

The K-Means algorithm is sensitive to the initial positions of the centroids, so it's commonly run multiple times with different initial positions and the best result is chosen.
A key parameter in K-Means is the number of clusters, k, which needs to be specified in advance. A common approach to determine the value of k is to run the algorithm for different values of k and select the one that gives the best result based on some evaluation metric like the silhouette score, elbow method or gap statistics.
K-means is a widely used algorithm for clustering, it is simple to understand, easy to implement and computationally efficient for large datasets. However, K-means has some limitations as well, it assumes that all clusters have similar size and spherical shapes which may not be the case in real-world problems, it also assumes that the data is isotropic and the clusters are equally sized. Also, it is sensitive to initial conditions, noise and outliers.
In order to overcome these limitations, other methods such as Hierarchical Clustering, DBSCAN, and Gaussian Mixture Model can be used.
Example
In this example, we first generate some synthetic data using the numpy library. Next, we create an instance of the KMeans
class with 3 clusters and a random seed of 0. We then fit the model to the data using the fit
method and make predictions on the same data using the predict
method.
K-means is a simple and widely used clustering algorithm, it's fast, easy to understand and implement. However, it's sensitive to initial conditions and the number of clusters needs to be specified in advance. It also assumes that clusters have similar size and have spherical shapes which might not be the case in real-world problem
Python code
Output
References
https://towardsdatascience.com/k-means-a-complete-introduction-1702af9cd8c
Last updated