The t-distributed stochastic neighbor embedding (short: tSNE) is an unsupervised algorithm for dimension reduction in large data sets. Traditionally, either Principal Component Analysis (PCA) is used for linear contexts or neural networks for non-linear contexts. The tSNE algorithm is an alternative that is much simpler compared to these algorithms.
Why do you need the dimension reduction?
Various algorithms, such as linear regression, have problems if the data set has variables that are correlated with each other, i.e. depend on each other. To avoid this problem, it can make sense to remove the variables from the data set that correlate with another variable. At the same time, however, the data should not lose its original information content or should retain as much information as possible.
Another application we have is in cluster analysis, such as k-means clustering, where we need to define the number of clusters in advance. Reducing the dimensionality of the data set helps us to get a first impression of the information and to be able to estimate, for example, which are the most important variables and how many clusters the data set could have. For example, if we manage to reduce the data set to three dimensions, we can visualize the data points in a diagram. From this, it may then already be possible to read off the number of clusters.
In addition, large data sets with many variables also offer the danger that the model overfits. Simply explained, this means that the model adapts too much to the training data during training and thus only delivers poor results for new, unseen data. Therefore, for neural networks, for example, it can make sense to first train the model with the most important variables and then add new variables bit by bit, which may further increase the performance of the model without overfitting.
How does tSNE work?
The approach of tSNE is relatively simple in theory. Assuming we have a high-dimensional data set, we define a distance measure between the data points. This can be known distance measures, but it can also be custom functions that are defined. In many cases, this involves normalizing the distance so that it is the differences in the data points that matter, rather than the actual distance in space.
The tSNE algorithm then tries to find a low-dimensional space in which these distances are preserved as well as possible. For this purpose, it uses the so-called gradient method to improve the results step by step.
What applications use the process?
tSNE has several applications in data analysis and visualization, including:
- Visualization of high-dimensional data: The technique can be used to reduce the dimensionality of high-dimensional data to a two- or three-dimensional space, allowing for easier visualization and interpretation.
- Clustering: tSNE can be used to identify clusters in data by mapping similar data points closer together in reduced dimensional space.
- Anomaly detection: Dimensional reduction can be used to identify outliers or anomalies in data by mapping them far away from the main cluster.
- Image analysis: tSNE has been used in image analysis tasks such as object detection, classification, and segmentation.
- Natural Language Processing: tSNE has been used for word embedding visualization and topic modeling in natural language processing.
- Genomics: tSNE has been used for visualization and analysis of high-dimensional gene expression data in genomics research.
- Neuroscience: Dimensionality reduction has been used for the analysis and visualization of large neural recordings in neuroscience research.
- Recommender systems: tSNE has been used in recommender systems for clustering similar products and visualizing user preferences.
Overall, tSNE is a versatile tool that can be used for a wide range of data analysis tasks.
What factors improve the result?
As we have already learned, this algorithm uses an approach in which the result is approached step by step. We already know this from the field of Machine Learning. Accordingly, there are also so-called hyperparameters, whose value can have a great influence on the quality of the result. The following parameters must be taken into account:
- Number of Iterations: In general, the algorithm will approach a better and better result with more iterations. However, the improvement decreases with each iteration, which means that the result improves only very slowly. Thus, a trade-off must be made between the quality of the result and the training time.
- Learning Rate: The learning rate influences the size of the changes in each iteration. A low learning rate leads to the result converging only very slowly, while a high learning rate can lead to the algorithm not converging at all, i.e. not approaching a unique result.
- Perplexity: Simply put, perplexity determines how the weighting between local and global dependencies should be. In many cases, a higher perplexity means separated clusters, while a low perplexity means that the data remain relatively close together.
tSNE vs. Principal Component Analysis
Although the goal of PCA and tSNE is initially the same, namely dimension reduction, there are some differences in the algorithms. First, tSNE works very well for one data set, but cannot be applied to new data points, since this changes the distances between the data points and a new result must be calculated. PCA, on the other hand, produces a rule as a result that can also be applied to new data points that were not yet part of the data set during training.
The t-distributed stochastic neighbor embedding algorithm can also be used when the relationships between data points are non-linear. Principal Component Analysis, on the other hand, can only detect linear dependencies and include them in the separation. For non-linear dependencies, neural networks can also be used, but their effort and training are time-consuming. Although tSNE also has a relatively long training phase compared to PCA, it is usually still shorter than for neural networks and thus represents a good compromise.
Another important difference between PCA and tSNE is the focus on data distribution. Principal Component Analysis tries to maintain the global arrangement of data points even in fewer dimensions. tSNE, on the other hand, focuses more on local distances and correlations, which should be maintained even in lower dimensions. Therefore, it may appear that after a dimension reduction by tSNE, the data looks as if it has already been divided into clusters as well.
What are the limitations of tSNE?
Although tSNE has proven to be a powerful tool for visualizing high-dimensional data, it is not without limitations. Some of the main limitations of tSNE are:
- Sensitivity to hyperparameters: The performance of tSNE is highly dependent on the choice of hyperparameters, especially the perplexity value. Choosing the wrong perplexity value can lead to poor results, and there is no generally accepted method for selecting an appropriate value.
- Difficulty in interpreting results: Although tSNE can provide visually compelling results, it can be difficult to interpret the meaning of the resulting visualization, especially when dealing with complex or noisy data.
- Susceptibility to overfitting: Like many Machine Learning techniques, tSNE is susceptible to overfitting, especially for small datasets or datasets with low signal-to-noise ratios.
- Scalability: Although tSNE is capable of visualizing high-dimensional data, it can be computationally intensive and time consuming when run on large datasets.
- Potential misleading results: In some cases, tSNE can produce misleading results, especially on data that have complex or non-linear relationships between variables.
Despite these limitations, tSNE remains a valuable tool for visualizing and analyzing high-dimensional data. To avoid potential pitfalls and obtain the best results, it is important to choose hyperparameters carefully, validate results with other techniques, and interpret results with caution.
How to implement tSNE in Python?
tSNE can be implemented using a few lines of code in Python. To do this, we define four random NumPy arrays that have four dimensions. We want to reduce these dimensions to two. To do this, we import the TSNE function from Scikit-Learn.
In this function, we can define the desired number of components, i.e. the final dimensions. The learning rate is to be determined automatically, and we also set a perplexity of 3. After a short wait, we now get four NumPy arrays with only two dimensions each, as desired. At the same time, it is noticeable that the numerical values have increased by quite a bit compared to the initial values. However, this was to be expected, since as already mentioned, only the distance between the data points is tried to be kept the same and not the global positioning.
The example was largely taken from Scikit-Learn’s documentation on tSNE.
This is what you should take with you
- The t-distributed stochastic neighbor embedding (short: tSNE) is an unsupervised algorithm for dimension reduction in large data sets.
- It is needed to reduce datasets in dimension and thus prevent possible overfitting of models.
- The main difference to Principal Component Analysis is that it can also be used for non-linear relationships between data points.
Unlocking the Mean Squared Error (MSE): Learn its Math, Applications, and Practical Insights in Data Analysis & ML.
Other Articles on the Topic of tSNE
- Scikit-Learn provides extensive documentation on how to use tSNE in Python.