Clustering

Clustering is a method to divide data points into a number of groups which the data in this group should have more similarity than others.

For example: if we know the petal length and sepal length of some flowers, we can divide these flowers into several group.

flower

k-means

  1. Specify the number of clusters k
  2. Randomly assign each data to a cluster
  3. Computer the centroids for each cluster
  4. Re-assign the points to the closest centroids
  5. Re-compute the centroids for each cluster
  6. Repeat 4 and 5 until no improvements

![123}(https://image.jiqizhixin.com/uploads/editor/7958a474-21ec-4e94-a7a1-c53b63556d97/640.gif)]

Advantage:

  • For large amount dataset, it can be converge very fast

Disadvantage:

  • Hard to predict the value kk
  • It can not work well for different density or size cluster
  • It depends on the initial centroids, different initialization may results the different cluster outcomes

How to set kk value?

There is a popular method known as elbow method which is used to determine the optimal value of K to perform the K-Means Clustering Algorithm. The basic idea behind this method is that it plots the various values of cost with changing k. As the value of K increases, there will be fewer elements in the cluster. So average distortion will decrease. The lesser number of elements means closer to the centroid. So, the point where this distortion declines the most is the elbow point.

For example:

img

In the above figure, its clearly observed that the distribution of points are forming 3 clusters. Now, let’s see the plot for the squared error(Cost) for different values of K.

img

Elbow is forming at K=3

Clearly the elbow is forming at K=3. So the optimal value will be 3 for performing K-Means.

Hierarchical Clustering

This can be divided into 2 types: Top-down and Bottom-up

For Bottom-up:

  1. From bottom, assign each data as a cluster
  2. Two closest cluster are merged into one cluster (calculate the distance)
  3. The best choice of the number of clusters is to find number of vertical line at maximum distance between the mergence . (Here is the number of blue line between A to B)

bottom-up

Comparison

  • Hierarchical Clustering can not handle the big data but k-means can. k-means has O(n)O(n) and HC has O(n2)O(n^2)

  • k-mean is good when cluster has shape circle in 2D, sphere in 3D.

Dbscan (Density Based Spatial Clustering of Applications with Noise)

Dbscan is a density based clustering algorithm, it is focused on finding neighbors by density (MinPts) on an ‘n-dimensional sphere’ with radius ɛ. A cluster can be defined as the maximal set of ‘density connected points’ in the feature space.

Author: shixuan liu
Link: http://tedlsx.github.io/2019/08/26/clustering/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • Wechat
  • Alipay

Comment