Skip to content

What is k-Means Clustering?

The k-Means clustering is a method for forming groups within a large data set. The algorithm tries to assign each data point to exactly one group in several steps so that the data within a cluster are as close to each other as possible and the clusters are as far apart as possible.

What is Clustering?

Clustering refers to finding subgroups within a data set so that data points within a group are as similar as possible and data points between different groups are as different as possible. Clustering takes place on the basis of properties, so-called features. That is, the groups are formed based on the fact that the data points within a cluster have similar feature characteristics.

Since there are no predefined groups to which the algorithm could orient itself during learning, clustering counts as Unsupervised (Machine) Learning. The assignment to groups happens solely on the structures within the data set, which the model has to learn on its own.

When working with large data sets, clustering helps to get a first impression about the features and the distribution of the data points. There are also many other applications for clustering:

  • Market segmentation: An attempt is made to find similar customer groups with comparable purchasing behavior or other characteristics.
  • Image segmentation: An attempt is made to find the locations within an image that belong to a specific object, e.g. all pixels that are part of a car or the road.
  • Document clustering: Within a document, an attempt is made to find passages with similar content focus.

How does the k-Means Algorithm work?

The k-Means algorithm is a special clustering method that iteratively tries to find a total of k clusters in the data set so that each data point belongs to exactly one group. Roughly speaking, there are a total of five steps, which we will look at in detail:

  1. We need to determine the number of clusters k.
  2. Randomly k data points are determined as cluster centers, the so-called centroids.
  3. All other data points are assigned to the cluster whose cluster center has the smallest distance to the data point.
  4. Now a new cluster center is calculated for each group. It corresponds to the point in the cluster that has the smallest distance to all other data points in the group.
  5. If the cluster center has not changed at all (or very little), the algorithm stops here. If not, it goes back to step 3 and reassigns the data points to the next cluster.
Das Bild zeigt den Ablauf beim k-Means Clustering.
k-Means Clustering Process

How to find the right Number of Clusters?

A major drawback of the k-Means clustering method is that one must determine the number of final groups before training the model. Of course, this is actually impossible without knowledge of the data set.

The so-called elbow method supports finding the right number of clusters. For different k values, the sum of all squared distances from the cluster center to the data points is calculated. For a single cluster, this sum is maximum and it becomes zero if we have as many clusters as data points, because then each data point is simply a cluster and the distance between cluster center and data point is correspondingly 0. In between, we try to find the optimum, i.e. the point at which the distances within a cluster are as small as possible and the number of clusters is also as small as possible.

Das Bild zeigt die Ellbogenmethode beim k-Means Clustering.
Elbow Method | Photo: O’Reilly Media

From this graph, we then try to find the so-called knee or elbow, i.e. the point in the graph that has the prominent kink. From this point on, the distance of the points in the cluster does not decrease as much as before. Thus, new clusters do not improve the overall result as much as before. So in this case we should choose k = 3 to get an optimal result.

This is what you should take with you

  • K-Means Clustering is a method for forming groups of large data sets and belongs to the Unsupervised Learning methods.
  • If possible, points within a group/cluster are relatively similar, while data points from different clusters are as different as possible.
  • The k-Means clustering method is used, for example, to determine customer segments or to find areas in images that show the same thing.
  • Using the so-called Elbow method, we can find the optimal number of cluster centers in advance.
  • On this page, you can observe a k-Means clustering model step by step as it is trained.
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner