2️⃣Hierarchical Clustering

Hierarchical Clustering is a type of unsupervised machine learning algorithm used to group similar objects into clusters or classes. It is a method of clustering which seeks to build a hierarchy of clusters, where each cluster is subdivided into smaller clusters, until individual data points form the clusters.

There are two main approaches for Hierarchical Clustering: Agglomerative and Divisive.

  • Agglomerative Hierarchical Clustering: This approach starts with each data point as a separate cluster, and then merges the closest pair of clusters until only a single cluster remains. This process is also known as a "bottom-up" approach.

  • Divisive Hierarchical Clustering: This approach starts with all data points in a single cluster, and then recursively splits the cluster into smaller clusters until each data point forms its own cluster. This process is also known as a "top-down" approach.

The result of hierarchical clustering is a tree-like diagram called a dendrogram, which shows the merging or splitting of clusters at each step. The dendrogram can be used to identify the number of clusters by cutting the dendrogram at a desired level.

Hierarchical Clustering is often used when the number of clusters is not known in advance, and it's also useful for exploring the structure of the data. However, it can be sensitive to the scale of the data, and it may be slow for large datasets.

Hierarchical Clustering can be implemented using various linkage methods, such as Single Linkage, Complete Linkage, Average Linkage, and Ward linkage. Each linkage method uses a different distance metric to compute the similarity between two clusters.

Single linkage: it's based on the minimum distance between two points in different clusters. Complete linkage: it's based on the maximum distance between two points in different clusters. Average linkage: it's based on the average distance between all the points in different clusters. Ward linkage: it's based on the minimum variance between two clusters.

Hierarchical Clustering is a useful method for exploring the structure of a dataset, but it's usually not as efficient as other clustering methods like k-means. It is generally used as a method of exploratory data analysis, and it's not as commonly used as a method of final clustering.

Analogy:

An analogy for hierarchical clustering can be thought of as building a family tree. Imagine you are trying to create a family tree for a group of people, and you want to group them into families based on their similarities.

You start by looking at all the people individually, and then you find the two people who are most similar to each other and group them together as a family. Then, you repeat the process by looking for the next most similar pair of families and group them together as a bigger family. This process continues until all the people are grouped into one big family tree.

This process is similar to how Hierarchical Clustering works, it starts by treating each data point as a separate cluster, and then it iteratively merges the closest pair of clusters until only a single cluster remains. The result is a tree-like diagram called a dendrogram, which shows the merging or splitting of clusters at each step.

Just like in the case of creating a family tree, hierarchical clustering is useful when the number of clusters is not known in advance, and it's also useful for exploring the structure of the data. However, it can be sensitive to the scale of the data, and it may be slow for large datasets.

Example

In this example, the make_blobs function is used to generate synthetic data with three clusters. Then, the AgglomerativeClustering class is imported and used to perform hierarchical clustering on the data. The number of clusters is set to 3. The fit() method is used to fit the model to the data.

Finally, the results are plotted using a scatter plot where the x and y axis are the two features of the dataset and the color of each point represents the cluster it belongs to.

It's important to note that this is a simple example and in the real world the dataset will be more complex, the features may be represented by multiple variables and the linkage method can be chosen based on the problem, but the idea behind the hierarchical clustering remains the same. Hierarchical Clustering is a powerful method for exploring the structure of a dataset and it's useful when the number of clusters is not known in advance.

Python codeit

from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# generate synthetic dataset
X, y = make_blobs(n_samples=150, n_features=2, centers=3, cluster_std=0.5, shuffle=True, random_state=0)

# perform hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
agg_clustering.fit(X)

# plot the results
plt.scatter(X[:, 0], X[:, 1], c=agg_clustering.labels_, cmap='rainbow')
plt.show()

Output

References

https://harshsharma1091996.medium.com/hierarchical-clustering-996745fe656b

Last updated