The t-distributed stochastic neighbor embedding (short: tSNE) is an unsupervised algorithm for dimension reduction in large data sets. Traditionally, either Principal Component Analysis (PCA) is used for linear contexts or neural networks for non-linear contexts. The tSNE algorithm is an alternative that is much simpler compared to these algorithms.
Why do you need the dimension reduction?
Various algorithms, such as linear regression, have problems if the data set has variables that are correlated with each other, i.e. depend on each other. To avoid this problem, it can make sense to remove the variables from the data set that correlate with another variable. At the same time, however, the data should not lose its original information content or should retain as much information as possible.
Another application we have is in cluster analysis, such as k-means clustering, where we need to define the number of clusters in advance. Reducing the dimensionality of the data set helps us to get a first impression of the information and to be able to estimate, for example, which are the most important variables and how many clusters the data set could have. For example, if we manage to reduce the data set to three dimensions, we can visualize the data points in a diagram. From this, it may then already be possible to read off the number of clusters.
In addition, large data sets with many variables also offer the danger that the model overfits. Simply explained, this means that the model adapts too much to the training data during training and thus only delivers poor results for new, unseen data. Therefore, for neural networks, for example, it can make sense to first train the model with the most important variables and then add new variables bit by bit, which may further increase the performance of the model without overfitting.
How does tSNE work?
The approach of tSNE is relatively simple in theory. Assuming we have a high-dimensional data set, we define a distance measure between the data points. This can be known distance measures, but it can also be custom functions that are defined. In many cases, this involves normalizing the distance so that it is the differences in the data points that matter, rather than the actual distance in space.
The tSNE algorithm then tries to find a low-dimensional space in which these distances are preserved as well as possible. For this purpose, it uses the so-called gradient method to improve the results step by step.
What factors improve the result?
As we have already learned, this algorithm uses an approach in which the result is approached step by step. We already know this from the field of Machine Learning. Accordingly, there are also so-called hyperparameters, whose value can have a great influence on the quality of the result. The following parameters must be taken into account:
- Number of Iterations: In general, the algorithm will approach a better and better result with more iterations. However, the improvement decreases with each iteration, which means that the result improves only very slowly. Thus, a trade-off must be made between the quality of the result and the training time.
- Learning Rate: The learning rate influences the size of the changes in each iteration. A low learning rate leads to the result converging only very slowly, while a high learning rate can lead to the algorithm not converging at all, i.e. not approaching a unique result.
- Perplexity: Simply put, perplexity determines how the weighting between local and global dependencies should be. In many cases, a higher perplexity means clearly separated clusters, while a low perplexity means that the data still remain relatively close together.
tSNE vs. Principal Component Analysis
Although the goal of PCA and tSNE is initially the same, namely dimension reduction, there are some differences in the algorithms. First, tSNE works very well for one data set, but cannot be applied to new data points, since this changes the distances between the data points and a new result must be calculated. PCA, on the other hand, produces a rule as a result that can also be applied to new data points that were not yet part of the data set during training.
The t-distributed stochastic neighbor embedding algorithm can also be used when the relationships between data points are non-linear. Principal Component Analysis, on the other hand, can only detect linear dependencies and include them in the separation. For non-linear dependencies, neural networks can also be used, but their effort and training are time-consuming. Although tSNE also has a relatively long training phase compared to PCA, it is usually still shorter than for neural networks and thus represents a good compromise.
Another important difference between PCA and tSNE is the focus on data distribution. Principal Component Analysis tries to maintain the global arrangement of data points even in fewer dimensions. tSNE, on the other hand, focuses more on local distances and correlations, which should be maintained even in lower dimensions. Therefore, it may appear that after a dimension reduction by tSNE, the data looks as if it has already been divided into clusters as well.
How to implement tSNE in Python?
tSNE can be implemented using a few lines of code in Python. To do this, we define four random NumPy arrays that have four dimensions. We want to reduce these dimensions to two. To do this, we import the TSNE function from Scikit-Learn.
In this function we can define the desired number of components, i.e. the final dimensions. The learning rate is to be determined automatically, and we also set a perplexity of 3. After a short wait, we now get four NumPy arrays with only two dimensions each, as desired. At the same time, it is noticeable that the numerical values have increased by quite a bit compared to the initial values. However, this was to be expected, since as already mentioned, only the distance between the data points is tried to be kept the same and not the global positioning.
The example was largely taken from Scikit-Learn’s documentation on tSNE.
This is what you should take with you
- The t-distributed stochastic neighbor embedding (short: tSNE) is an unsupervised algorithm for dimension reduction in large data sets.
- It is needed to reduce datasets in dimension and thus prevent possible overfitting of models.
- The main difference to Principal Component Analysis is that it can also be used for non-linear relationships between data points.
Other Articles on the Topic of tSNE
- Scikit-Learn provides extensive documentation on how to use tSNE in Python.