The k-Means clustering is a method for forming groups within a large data set. The algorithm tries to assign each data point to exactly one group in several steps so that the data within a cluster are as close to each other as possible and the clusters are as far apart as possible.
What is Clustering?
Clustering refers to finding subgroups within a data set so that data points within a group are as similar as possible and data points between different groups are as different as possible. Clustering takes place on the basis of properties, so-called features. That is, the groups are formed based on the fact that the data points within a cluster have similar feature characteristics.
Since there are no predefined groups to which the algorithm could orient itself during learning, clustering counts as Unsupervised (Machine) Learning. The assignment to groups happens solely on the structures within the data set, which the model has to learn on its own.
When working with large data sets, clustering helps to get a first impression about the features and the distribution of the data points. There are also many other applications for clustering:
- Market segmentation: An attempt is made to find similar customer groups with comparable purchasing behavior or other characteristics. This information can then be used for marketing activities or new product innovations.
- Image segmentation: An attempt is made to find the locations within an image that belong to a specific object, e.g. all pixels that are part of a car or the road. This can also be used as a pre-processing step for machine learning models.
- Document clustering: Within a document, an attempt is made to find passages with similar content focus.
How does the k-Means Algorithm work?
The k-Means algorithm is a special clustering method that iteratively tries to find a total of k clusters in the data set so that each data point belongs to exactly one group. Roughly speaking, there are a total of five steps, which we will look at in detail:
- We need to determine the number of clusters k.
- Random k data points are determined as cluster centers, the so-called centroids. There are different, so-called initialization methods for choosing the initial cluster centers. The performance of the algorithm can strongly depend on the initial centroids depending on the use case, so different initial scenarios should be tried.
- All other data points are assigned to the cluster whose cluster center has the smallest distance to the data point.
- Now a new cluster center is calculated for each group. It corresponds to the point in the cluster that has the smallest distance to all other data points in the group.
- If the cluster center has not changed at all (or very little), the algorithm stops here. If not, it goes back to step 3 and reassigns the data points to the next cluster.
What are the applications of k-means clustering?
k-Means clustering and clustering in general are popular in the field of unsupervised learning. Especially when the dataset has no labels that can be predicted concretely, and yet a grouping of the data should take place. In this section, we present some of the most common applications of k-means clustering.
- Image segmentation: When people look at images, they consciously or unconsciously create segments that are processed individually. The positioning of the elements also contains information, such as whether an object can be seen in the center of the image or the top right corner. K-means clustering is a pre-processing step in computer vision to identify segments with similar coloring or texture characteristics.
- Customer segmentation: Companies try to adapt their marketing strategies to different customer groups to create the most appealing communication possible. To do this, these groups must be so well defined that the customers within a group are as similar as possible and can be addressed with the same marketing tool. The k-means clustering is characterized by the fact that the homogeneity of a cluster is as high as possible and is therefore often used in customer segmentation.
- Anomaly detection: When detecting fraudulent activities, for example in e-mails or bank transactions, a large number of “benign” actions often encounter a small number of anomalies. Nevertheless, these fraudulent activities can cause a great deal of damage, for example, if a false bank transfer is made. For this reason, companies strive to use the most accurate anomaly detection possible, which often involves the use of k-means clustering.
- Genomic clustering: In biology, attempts are made to find related genes or proteins that can be grouped based on their functional similarities. This can also be done using k-means clustering.
- Analysis of social networks: Similar to customer segmentation, attempts are also made in social networks to find users with similar interests or similar behavior to either connect them or suggest similar content to them. Depending on the structure of the data, k-Means clustering can be used for this purpose.
- Clustering of documents: Within companies, there are often a large number of documents that are either duplicates or revolve around a similar topic. To bring order to these huge collections of data, k-Means clustering can be used.
This selection of examples clearly shows the versatility of k-means clustering and its broad applicability. Due to its simplicity and versatility, this algorithm is a popular choice when data needs to be grouped.
How to find the right Number of Clusters?
A major drawback of the k-Means clustering method is that one must determine the number of final groups before training the model. Of course, this is actually impossible without knowledge of the data set.
The so-called elbow method supports finding the right number of clusters. For different k values, the sum of all squared distances from the cluster center to the data points is calculated. For a single cluster, this sum is maximum and it becomes zero if we have as many clusters as data points, because then each data point is simply a cluster and the distance between cluster center and data point is correspondingly 0. In between, we try to find the optimum, i.e. the point at which the distances within a cluster are as small as possible and the number of clusters is also as small as possible.
From this graph, we then try to find the so-called knee or elbow, i.e. the point in the graph that has the prominent kink. From this point on, the distance of the points in the cluster does not decrease as much as before. Thus, new clusters do not improve the overall result as much as before. So in this case we should choose k = 3 to get an optimal result.
How do you evaluate the performance of a k-Means cluster?
To evaluate k-Means clustering, various metrics can be calculated after training to assess the quality of the clusters. Some of the most common metrics are
- Sum of squares within a cluster: this is the sum of the distances between each point in a cluster and the centroid. This value should be minimized to obtain an appropriately compact cluster with similar data points. However, this value is not sufficient as the only evaluation, as it only measures within a cluster.
- Silhouette value: This key figure is used to calculate how well a data point is assigned to a cluster and whether it does not fit better with another cluster. The values are between -1 and 1, with values around -1 indicating a poor assignment of the data point.
- Davies-Bouldin Index (DBI): This index provides information on how similar a cluster is to its nearest similar cluster. The similarity between two clusters should be as low as possible to achieve a clear separation between groups.
- Calinski-Harabasz Index (CHI): This key figure is used to calculate the ratio of the variance outside a cluster to the variance within the cluster. A high CHI value indicates a good grouping, as the variance within the cluster is low and outside the cluster is high.
These are some of the most important key figures for evaluating a clustering. It is important to note that several key figures should be calculated and included in the evaluation of the model in order to obtain meaningful results.
What are the limitations of k-Means Clustering?
Despite the widespread use of k-Means Clustering, this algorithm also has certain disadvantages or limitations that should be taken into account in practical application.
The biggest disadvantage of k-means clustering is that the number of clusters must be known before starting the training. Although there are ways to find the optimal number based on the data, this number may not match the number required by the application. As the number of clusters is a decisive factor that has a massive influence on the quality of the results, it is advisable to carry out detailed investigations here.
Another disadvantage also occurs at the beginning of the training, namely the sensitivity to the initialization of the cluster centers. Depending on which method is selected for this, different cluster memberships may occur during several training runs. For this reason, several clustering runs with different initializations should be carried out to obtain a stable prediction of group membership.
Furthermore, k-means clustering assumes that the groups are spherical and therefore all have the same variance. The algorithm optimizes that the distances of the points to the cluster center within a group are minimized. However, this implies that the optimal group lies within a sphere around the cluster center. This is not always the case, especially with more complex data structures.
A final disadvantage is the susceptibility to outliers, which deviate very strongly from other data and are therefore also very far away from the centroid. This greatly impairs training and cluster assignment and can unnecessarily prolong the algorithm run. If possible, data should therefore be pre-processed before clustering to remove outliers from the data set.
Despite these disadvantages, k-means clustering is a powerful algorithm within unsupervised learning that can be used for many practical applications. However, it is very important to think about the number of clusters and the initialization method in advance to obtain meaningful results.
This is what you should take with you
- K-Means Clustering is a method for forming groups of large data sets and belongs to the Unsupervised Learning methods.
- If possible, points within a group/cluster are relatively similar, while data points from different clusters are as different as possible.
- The k-Means clustering method is used, for example, to determine customer segments or to find areas in images that show the same thing.
- Using the so-called Elbow method, we can find the optimal number of cluster centers in advance.
What are N-grams?
Unlocking NLP's Power: Explore n-grams in text analysis, language modeling, and more. Understand the significance of n-grams in NLP.
What is the No-Free-Lunch Theorem?
Unlocking No-Free-Lunch Theorem: Implications & Applications in ML & Optimization
What is Automated Data Labeling?
Unlock efficiency in machine learning with automated data labeling. Explore benefits, techniques, and tools for streamlined data preparation.
What is Synthetic Data Generation?
Elevate your data game with synthetic data generation. Uncover insights, bridge data gaps, and revolutionize your approach to AI.
What is Multi-Task Learning?
Boost ML efficiency with Multi-Task Learning. Explore its impact on diverse domains from NLP to healthcare.
What is Federated Learning?
Elevate machine learning with Federated Learning. Collaborate, secure, and innovate while preserving privacy.
Other Articles on the Topic of k-Means Clustering
- On this page, you can observe a k-Means clustering model step by step as it is trained.
Niklas Lang
I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.
My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.