What is K-Means Clustering?
K-means clustering is an unsupervised learning algorithm that can be used for solving clustering problems in machine learning. K-means clustering takes a bunch of unlabeled data and groups them into k clusters. The clustering is done in such a way that each point belongs to its nearest cluster center. And we usually use the Manhattan distance or Euclidean distance to measure the distance between each point and cluster centers.
In the Manhattan distance, the distance between two points are measured along the axes at right angles. For example, the Manhattan distance between (x1, y1) and (x2, y2) in 2-dimensional space is |x1 –x2| + |y1 – y2|.
The Euclidean distance of those two points, on the other hand, will be:
How does the K-Means Clustering algorithm work?
The k-means clustering algorithm works in the following way:
1. Firstly, we choose k cluster centers randomly.
2. We calculate the distance between each point and cluster centers. Then, we determine which cluster center the point belongs to. Please note that each point belongs to its nearest cluster center.
3. Now, we recalculate the cluster centers. We take all the points belonging to a cluster and find out the mean. The new cluster center is located at the mean distance from all the points that belong to the cluster.
4. We repeat steps 2 and 3 until the algorithm converges. In other words, we repeat the algorithm until the new locations of the cluster centers are the same as the previous locations.
If a dataset has N observations and D features, then we can represent the dataset using an NxD matrix. And the input to the k-means clustering algorithm is the NxD matrix.
What are the advantages and disadvantages of K-Means Clustering?
K-means clustering is a simple algorithm that can be applied to large datasets. But, the disadvantage is the value of k has to be chosen manually. And the convergence of the algorithm depends on this value of k.
Moreover, the clustering performance can also be affected by outliers.
0 Comments