Principal Component Analysis (PCA) is used when you want to reduce the number of variables in a large data set. It tries to keep only those variables in the data set that explain a large part of the variance. All features that correlate strongly with other features are removed.
When do we use Principal Component Analysis?
Various algorithms, such as linear regression, have problems if the data set has variables that are correlated with each other, i.e. depend on each other. To avoid this problem, it can make sense to remove the variables from the data set that highly correlate with another variable. At the same time, however, the data should not lose its original information content or retain as much information as possible. Principal Component Analysis promises to remove exactly those variables that are correlated with others and do not mean a large loss of information.
Another application of PCA is in cluster analysis, such as k-means clustering, where we need to define the number of clusters in advance. Reducing the dimensionality of the data set helps us to get a first impression of the information and to be able to estimate, for example, which are the most important variables and how many clusters the data set could have. For example, if we manage to reduce the data set to three dimensions, i.e. three variables, we can visualize the data points in a three-dimensional diagram. From this, we may be able to see the number of clusters.
In addition, large data sets with many variables also offer the risk that the model overfits. Simply explained, this means that the model adapts too much to the data during training and thus only delivers poor results for new, unseen data. Therefore, for neural networks, for example, it can make sense to first train the model with the most important variables and then add new variables piece by piece, which may further increase the performance of the model without overfitting. Here, too, Principal Component Analysis is an important tool in the field of machine learning.
How does the PCA work?
The core idea of Principal Component Analysis is that possibly several variables in a data set measure the same thing, i.e. are correlated. Thus, the different dimensions can be combined into fewer so-called principal components without compromising the validity of the data set. Body size, for example, has a high correlation with shoe size, since in many cases tall people also have a larger shoe size and vice versa. So if we remove shoe size as a variable from our data set, the information content does not really decrease.
In statistics, the information content of a data set is determined by the variance. This indicates how far the data points are from the center. The smaller the variance, the closer the data points are to their mean value and vice versa. A small variance thus indicates that the mean value is already a good estimate for the data set.
In the first step, Principal Component Analysis tries to find a line that minimizes the distance between it and the data points as much as possible. This procedure is the same as in linear regression. This line is therefore a summed combination of all individual features of the data set and forms the first principal component.
An attempt is then made to create a second line that is orthogonal, i.e. perpendicular, to the first principal component and again minimizes the distance to the data points. The lines must be orthogonal to each other because the principal components should not be correlated with each other and also because a perpendicular line is very likely to explain variance that is not contained in the first component.
How many Principal Components are the target?
Basically, there is a correlation between the number of principal components and the remaining information content. This means that with more components you also explain even more variance and thus have information contained in the data set. Very few components, on the other hand, mean that the dimensions have been greatly reduced, which is the purpose of principal component analysis.
According to Kaiser (1960), however, there is a quite good reference point according to which the components can be selected. Only the principal components that have a variance greater than 1 should be selected. Because only these components explain more variance than a single variable in the data set possibly can and really lead to a dimension reduction.
How can the Principal Components be interpreted?
The principal components themselves are very difficult to interpret because they arise as a linear combination of the dimensions. Thus, they represent a weighted mixture of several variables. However, in practical applications, these combinations of variables can also be interpreted concretely.
For example, consider a data set with various information about individuals, such as age, weight, height, creditworthiness, income, savings, and debt. In this dataset, for example, two principal components might emerge. The first principal component would presumably be composed of the dimensions of creditworthiness, income, savings, and debt, and would have high coefficients for these variables. This principal component could then be interpreted, for example, as the person’s financial stability.
What are the requirements?
Compared to similar statistical analyses, Principal Component Analysis has only a few requirements that must be met in order to obtain meaningful results. The basic properties that the data set should have are:
- The correlation between the features should be linear.
- The data set should be free of outliers, i.e. individual data points that deviate strongly from the mass.
- If possible, the variables should be continuous.
- The result of the PCA becomes better, the larger the sample is.
Not all data sets can be used for Principal Component Analysis without further ado. It must be ensured that the data are approximately normally distributed and interval-scaled, i.e. an interval between two numerical values always has the same spacing. Dates, for example, are interval scaled, because from 01.01.1980 to 01.01.1981 the time interval is the same as from 01.01.2020 to 01.01.2021 (leap years excluded). Above all, interval scaling must be judged by the user and cannot be detected by standardized, statistical tests.
This is what you should take with you
- Principal Component Analysis is used for dimension reduction in large data sets.
- It helps in the preprocessing of data for machine learning models based on it, such as cluster analyses or linear regressions.
- Certain prerequisites must be met in the data set for PCA to be possible at all. For example, the correlation between the features should be linear.
Other Articles on the Topic of Principal Component Analysis
- You can find a detailed explanation of principal component analysis, including an illustrative video, at our colleagues at Studyflix.
- Schimmelpfennig, H: Known, current and new requirements for driver analyses. In: Keller, B. et al. (Eds.): Marktforschung der Zukunft – Mensch oder Maschine?, Wiesbaden, 2016, pp. 231-243.
- Kaiser, H. F.: The Application of Electronic Computers to Factor Analysis. In: Educational and Psychological Measurement, No. 1/1960, pp. 141-151.