The curse of dimensionality is a problem that arises when you have a data set with many attributes/features. At first, it may seem positive to have many attributes and thus a lot of information, but this results in some problems when training Machine Learning models.
What is the Curse of Dimensionality?
The Curse of Dimensionality occurs with high dimensional data sets, those that have a large number of attributes or features. At first, having many attributes is a good thing because they contain a lot of information and circumscribe the data points well. For example, if we have a dataset about people, the attributes can be information such as hair color, height, weight, eye color, etc.
In a mathematical sense, each additional attribute means a new dimension in space and thus a significant increase in possibilities. Suppose we want to find out which customers buy certain products. In the first step, we only consider the age of the prospective customers and whether they have bought the product. We can still map this relatively easily in a two-dimensional diagram.
Once we add more information about the customer, things get a bit more complex. The information about the customer’s income would mean a new axis on which the numerical income is mapped. So the two-dimensional chart becomes a three-dimensional one. The additional attribute “gender” would lead to a fourth dimension and so on.
What problems arise with many dimensions?
When working with data, it is actually desirable to have a lot of attributes and information in the data set to give the model many possibilities to recognize structures in the data. However, it can also lead to serious problems, as the name Curse of Dimensionality already suggests.
The example shown illustrates a problem that occurs with many attributes. Due to a large number of dimensions, the so-called data space, i.e. the number of values that a data set can assume, also grows. This can lead to so-called data sparsity. This means that the training data set used to train the model does not contain certain characteristics at all or only very rarely. As a result, the model delivers only poor results for these edge cases.
Let’s assume that we examine 1,000 customers in our example, as it would be too time-consuming to survey even more customers or this data is simply not available. It is possible that all age groups from young to old are relatively well represented among these customers. However, if the additional dimension of income is added, it is rather unlikely that all possibilities of age and income combinations are simultaneously represented among the 1,000 customers.
In Machine Learning, when one wants to evaluate the similarity of different datasets, distance functions are often used for this purpose. The most common clustering algorithms, such as k-means clustering, rely on calculating the distance between points and assigning them to a cluster or not depending on the size. In multidimensional spaces, however, it can quickly come to the point that all points are similarly far apart, so that separation seems almost impossible.
We know this phenomenon to some extent from everyday life as well. If you take a photo of two objects, such as two trees, then they can look very close to each other in the picture, since it is only a two-dimensional image. In real life, however, it can happen that the trees are several meters apart, which only becomes clear in three dimensions.
These problems, which can arise in connection with many dimensions, are summarized under the term Curse of Dimensionality.
What can you do about the Curse of Dimensionality?
As you could probably already read out up to this point, the problem lies in the many dimensions. Accordingly, the solution is also obvious: We must manage to reduce the number of dimensions, preferably in such a way that as little information as possible is lost from the original data set.
There are basically two different approaches to this:
The first solution to combat the Curse of Dimensionality may be very single-minded: we select only those features that contribute particularly well to solving the problem. This means we limit ourselves only to dimensions that have some importance to the problem. There are several ways to do this as well, among the most common are these:
- Correlation analysis: To select suitable attributes, all correlations between two attributes are calculated. If two attributes have a high correlation, it may be possible to remove one of the attributes without losing much information content.
- Multicollinearity: This procedure is similar to the correlation analysis with the difference that not only pairwise attributes are examined, but also whether individual attributes can be represented as a sum, so to speak as a linear regression of other attributes. If this is possible, all attributes that are part of the regression line can be omitted.
The advantage of feature selection is that the attributes are preserved in their original form, which also makes them easier to interpret.
The second approach to combat the curse of dimensionality takes a different approach and attempts to form new, summarized dimensions from the many dimensions so that the number of dimensions is reduced while still retaining as much information content as possible.
A frequently used algorithm for this is the so-called Principal Component Analysis. The core idea of Principal Component Analysis is that several variables in a data set may measure the same thing, i.e. they are correlated. Thus, the different dimensions can be combined into fewer so-called principal components without compromising the validity of the data set. Body size, for example, has a high correlation with shoe size, since in many cases tall people also have a larger shoe size and vice versa. So if we remove shoe size as a variable from our data set, the information content does not really decrease.
In statistics, the information content of a data set is determined by the variance. This indicates how far the data points are from the center. The smaller the variance, the closer the data points are to their mean value and vice versa. A small variance thus indicates that the mean value is already a good estimate for the data set.
In the first step, PCA tries to find the variable that maximizes the explained variance of the data set. Then, step by step, more variables are added to explain the remaining part of the variance, because the variance, i.e. the deviation from the mean, contains the most information. This should be preserved if we want to train a model based on it.
This approach can in many cases lead to better results than a feature selection, but it creates new, artificial attributes that are not so easy to interpret. The first main component, for example, is a combination of various attributes. In individual cases, these can then be combined to form a larger supergroup. For example, attributes such as height, weight, and shoe size can be combined and interpreted as the group “physical characteristics”.
This is what you should take with you
- The Curse of Dimensionality describes the fact that data sets with many attributes, i.e. dimensions, can lead to problems during model training.
- Specifically, the curse of dimensionality can lead to data sparsity, i.e., the border areas of a given data set are not sufficiently represented.
- Furthermore, the so-called Distance Concentration occurs as the Curse of Dimensionality. This means that data points in high-dimensional spaces often have a similar distance, so clustering algorithms do not work at all or only very poorly.
ResNet: Residual Neural Networks -easily explained!
Explanation of a residual neural network with example in TensorFlow.
What is Batch Normalization?
Explanation of Batch Normalization and its advantages.
What is XGBoost?
Explanation of the XGBoost library including the gradient boosting method.
What is a Perceptron?
Explanation of perceptrons with example and how neural networks arise from them.
What is Overfitting?
Overfitting explained and strategies for avoiding it listed.
Cross Validation – easily explained!
Cross Validation explained with examples and concrete Python code snippets.
What is the Confusion Matrix?
Confusion Matrix explained with a detailed example.
How does the Apriori Algorithm work?
Explanation of the Apriori algorithm with an illustrative example.
What is Elasticsearch?
Explanation of the Elasticsearch search algorithm and its applications.
Long Short-Term Memory Networks (LSTM)- simply explained!
Explanation of Recurrent Neural Networks and LSTM models with example.
Other Articles on the Topic of Curse of Dimensionality
Tony Yiu illustrates the Curse of Dimensionality with a very nice example.