# tSNE: t-distributed stochastic neighbor embedding

The t-distributed stochastic neighbor embedding (short: tSNE) is an unsupervised algorithm for dimension reduction in large data sets. Traditionally, either Principal Component Analysis (PCA) is used for linear contexts or neural networks for non-linear contexts. The tSNE algorithm is an alternative that is much simpler compared to these algorithms.

### Why do you need the dimension reduction?

Various algorithms, such as linear regression, have problems if the data set has variables that are correlated with each other, i.e. depend on each other. To avoid this problem, it can make sense to remove the variables from the data set that correlate with another variable. At the same time, however, the data should not lose its original information content or should retain as much information as possible.

Another application we have is in cluster analysis, such as k-means clustering, where we need to define the number of clusters in advance. Reducing the dimensionality of the data set helps us to get a first impression of the information and to be able to estimate, for example, which are the most important variables and how many clusters the data set could have. For example, if we manage to reduce the data set to three dimensions, we can visualize the data points in a diagram. From this, it may then already be possible to read off the number of clusters.

In addition, large data sets with many variables also offer the danger that the model overfits. Simply explained, this means that the model adapts too much to the training data during training and thus only delivers poor results for new, unseen data. Therefore, for neural networks, for example, it can make sense to first train the model with the most important variables and then add new variables bit by bit, which may further increase the performance of the model without overfitting.

### How does tSNE work?

The idea of tSNE is to find a probability distribution in high-dimensional space that indicates how likely it is that two randomly selected points in space are direct neighbors. We then try to map this distribution from the high-dimensional space to a t-distribution (hence the t in tSNE) in the low-dimensional space in such a way that the divergence between the two distributions is minimized. Let’s take a closer look at the individual steps:

In the first step, a Gaussian normal distribution is formed, which gives the conditional probability that two randomly chosen points are direct neighbors. To do this, a data point is selected and the distances to all other points in the data set are calculated. The further apart the two points are, the further out one of the points lies on the normal curve. The starting point from which the measurement is taken is always the mean value of the Gaussian normal distribution.

In low-dimensional space, however, a t-distribution is used, which is very similar to a Gaussian normal distribution with the difference that the outer areas of the t-distribution are significantly higher. This means that even large distances can be mapped in a more differentiated way and do not appear too close together.

An iterative optimization procedure is used to transfer the existing distribution in high-dimensional space to the low-dimensional distribution as well as possible. The difference between the distributions is calculated using the Kullback-Leibler divergence and an attempt is made to minimize this step by step until an optimum is reached or the maximum number of iterations has been run through.

### What factors improve the result?

As we have already learned, this algorithm uses an approach in which the result is approached step by step. We already know this from the field of Machine Learning. Accordingly, there are also so-called hyperparameters, whose value can have a great influence on the quality of the result. The following parameters must be taken into account:

• Number of Iterations: In general, the algorithm will approach a better and better result with more iterations. However, the improvement decreases with each iteration, which means that the result improves only very slowly. Thus, a trade-off must be made between the quality of the result and the training time.
• Learning Rate: The learning rate influences the size of the changes in each iteration. A low learning rate leads to the result converging only very slowly, while a high learning rate can lead to the algorithm not converging at all, i.e. not approaching a unique result.
• Perplexity: Simply put, perplexity determines how the weighting between local and global dependencies should be. In many cases, a higher perplexity means separated clusters, while a low perplexity means that the data remain relatively close together.

### tSNE vs. Principal Component Analysis

Although the goal of PCA and tSNE is initially the same, namely dimension reduction, there are some differences in the algorithms. First, tSNE works very well for one data set, but cannot be applied to new data points, since this changes the distances between the data points and a new result must be calculated. PCA, on the other hand, produces a rule as a result that can also be applied to new data points that were not yet part of the data set during training.

The t-distributed stochastic neighbor embedding algorithm can also be used when the relationships between data points are non-linear. Principal Component Analysis, on the other hand, can only detect linear dependencies and include them in the separation. For non-linear dependencies, neural networks can also be used, but their effort and training are time-consuming. Although tSNE also has a relatively long training phase compared to PCA, it is usually still shorter than for neural networks and thus represents a good compromise.

Another important difference between PCA and tSNE is the focus on data distribution. Principal Component Analysis tries to maintain the global arrangement of data points even in fewer dimensions. tSNE, on the other hand, focuses more on local distances and correlations, which should be maintained even in lower dimensions. Therefore, it may appear that after a dimension reduction by tSNE, the data looks as if it has already been divided into clusters as well.

### What are the limitations of tSNE?

tSNE is a powerful tool for effectively transforming high-dimensional data into a low-dimensional space. This is particularly useful for visualizing such data sets to get a first impression of the location of the data points and to find possible clusters. However, the use of tSNE also has some disadvantages or limitations:

• Hyperparameters: The method uses some hyperparameters that have a strong influence on the quality of the result. Therefore, it often takes several training runs until an optimal set of hyperparameters is found. A run can take a lot of time and computing power, especially with large data sets.
• Local minima: tSNE is an optimization method that has a non-convex loss function. This means that in addition to the one global minimum, i.e. the lowest point in the entire definition range, there can also be so-called local minima, which look like a minimum in their immediate vicinity and have similar properties but are not the global lowest point. During the optimization process, the algorithm may get stuck in a local minimum and end the training without having found the optimal result.
• Handling new data: Compared to PCA, the tSNE method does not provide an embedding function that enables new data points to be mapped into the low-dimensional space. This means that no data points can be converted that were not part of the original data set. For this reason, tSNE is not suitable as a means of data compression or for converting data sets.
• Determinism: In computer science, we speak of a deterministic system if an algorithm always returns the same result when run several times with the same input parameters. This happens because a predefined path with fixed parameters is always followed. tSNE is not a deterministic system, which means that different results can occur with repeated training (even with identical hyperparameters).
• Global structures: The tSNE method aims to learn a conditional probability that indicates how likely it is that two data points are direct neighbors. Thus, great emphasis is placed on correctly mapping local structures, although it can also happen that large distances between data points are not optimally reproduced, which slightly distorts the final result.

In conclusion, tSNE is a useful method for visualizing high-dimensional data sets. However, the disadvantages mentioned should be borne in mind before using the method.

### How to implement tSNE in Python?

tSNE can be implemented using a few lines of code in Python. To do this, we define four random NumPy arrays that have four dimensions. We want to reduce these dimensions to two. To do this, we import the TSNE function from Scikit-Learn.

In this function, we can define the desired number of components, i.e. the final dimensions. The learning rate is to be determined automatically, and we also set a perplexity of 3. After a short wait, we now get four NumPy arrays with only two dimensions each, as desired. At the same time, it is noticeable that the numerical values have increased by quite a bit compared to the initial values. However, this was to be expected, since as already mentioned, only the distance between the data points is tried to be kept the same and not the global positioning.

The example was largely taken from Scikit-Learn’s documentation on tSNE.

### This is what you should take with you

• The t-distributed stochastic neighbor embedding (short: tSNE) is an unsupervised algorithm for dimension reduction in large data sets.
• It is needed to reduce datasets in dimension and thus prevent possible overfitting of models.
• The main difference to Principal Component Analysis is that it can also be used for non-linear relationships between data points.

## What is the Variance?

Explore variance's role in statistics and data analysis. Understand how it measures data dispersion.

## What is the Kullback-Leibler Divergence?

Explore Kullback-Leibler Divergence, a vital metric in information theory and machine learning, and its applications.

## What is the Maximum Likelihood Estimation?

Unlocking insights: Understand Maximum Likelihood Estimation (MLE), a potent statistical tool for parameter estimation and data modeling.

## What is the Variance Inflation Factor (VIF)?

Learn how Variance Inflation Factor (VIF) detects multicollinearity in regression models for better data analysis.

## What is the Dummy Variable Trap?

Escape the Dummy Variable Trap: Learn About Dummy Variables, Their Purpose, the Trap's Consequences, and how to detect it.

## What is the R-squared?

Introduction to R-Squared: Learn its Significance, Calculation, Limitations, and Practical Use in Regression Analysis.