Implement Agglomerative Hierarchical Clustering with Python | by Yufeng

Hierarchical clustering is one of the most basic clustering methods in statistical learning. If your dataset is not very big and you want to see not only the cluster label of each point but also some internal structure of the entire picture, hierarchical clustering is a good start point.

To make it clear, there are two types of hierarchical clustering methods, agglomerative and divisive clustering.

The only difference in terms of algorithm design is the direction of the clustering procedure.

Agglomerative clustering is a bottom-up approach where each data point is its own cluster at the beginning and then iteratively merge to larger clusters;

On the contrary, the divisive clustering is a top-down approach where the entire dataset is one single cluster at the beginning and the bigger clusters are split recursively as it proceeds.

Since agglomerative clustering is more popular for representing hierarchical clustering, due to its simpler implementation, better handling of noise, and computational efficiency with reasonably sized datasets, we will primarily discuss it in this post.

Basic idea

Let’s think about how kids form play groups in a community. At the very beginning, they don’t know much about each other, so every one is his/her own group.

After a while, every kid knows everyone else’s characteristics and interests, they tend to gather together with the kids that are similar to themselves. So, at this step, every two individuals start to merge into a small group and the small group itself then can be merged to a larger group by combining with another individual/small group.

Then the merging process iteratively happens based on how similar two groups of kids are to each other. The process can be stopped at some point when the kids think the number of big groups is reasonably small or no two groups…