Overfitting is a term from the field of data science and describes the property of a model to adapt too strongly to the training data set. As a result, the model performs poorly on new, unseen data. However, the goal of a Machine Learning model is a good generalization, so the prediction of new data becomes possible.
What is Overfitting?
The term overfitting is used in the context of predictive models that are too specific to the training data set and thus learn the scatter of the data along with it. This often happens when the model has too complex a structure for the underlying data. The problem is that the trained model generalizes poorly, i.e., provides inadequate predictions for new, unseen data. The performance on the training data set, on the other hand, was excellent, which is why one could assume a high model quality.
Some factors may indicate an impending overfitting early on:
- Small Data Set: If there are only a few individual data sets in the training, the probability is very high that these are simply learned by heart, and far too little information is available to be able to learn an underlying structure. The more training parameters the model has, the more problematic it becomes. A neural network, for example, has a large number of parameters on each hidden layer. Therefore, the more complex the model, the larger the data set should be.
- Selection of the Training Dataset: If the selection of datasets is already unbalanced, there is a high probability that the model will train them and thus have poor generalization. The sample from a population should always be randomly chosen so that selection bias does not occur. To make an extrapolation during an election, not only the voters at one polling place should be surveyed, as they are not representative of the whole country, but only represent the opinion in that constituency.
- Many Training Epochs: A model trains several epochs and in each epoch has the goal of further minimizing the loss function and thereby increasing the quality of the model. However, at a certain point, only improvements in backpropagation can be achieved by adapting more to the training dataset.
How do you recognize Overfitting?
Unfortunately, there is no central analysis that can determine with certainty whether a model is overfitted or not. However, there are some parameters and analyses that can provide indications of impending overfitting. The best and simplest method is to look at the error curve of the model over the iterations.
If the error in the training data set continues to decrease, but the error in the validation data set begins to increase again, this indicates that the model fits the training data too closely and thus generalizes poorly. The same evaluation can be done with the loss function.
To build such a graph, you need the so-called validation or test set, i.e. unseen data for the model. If the data set is large enough, you can usually split off 20-30% of the data set and use it as a test data set. Otherwise, there is also the possibility to use the so-called k-fold cross-validation, which is somewhat more complex and can therefore also be used for smaller data sets.
The data set is divided into k blocks of equal size. One of the blocks is randomly selected and serves as the test data set and the other blocks in turn are the training data. However, in the second training step, another block is defined as the test data, and the process repeats.
The number of blocks k can be chosen arbitrarily and in most cases, a value between 5 and 10 is chosen. A too large value leads to a less biased model, but the risk of overfitting increases. A too small k value leads to a more biased model, as it then actually corresponds to the hold-out method.
How to prevent Overfitting?
There are many different ways to prevent overfitting, or at least reduce the likelihood of it. From the following suggestions, in many cases already two should be sufficient to keep the risk of overfitting low:
- Data Set: The data set plays a very important role in avoiding overfitting. It should be as large as possible and contain different data. Furthermore, enough time should have been spent on the data preparation process. If incorrect or missing data occurs too frequently, the complexity increases and the risk of overfitting increases accordingly. A clean data set, on the other hand, makes it easier for the model to recognize the underlying structure.
- Data Augmentation: In many applications, such as image recognition, individual data sets are used and given to the model with slight modifications for training. These changes can be, for example, a black and white copy of an image or the same text with some typos in it. This makes the model more stable and helps it learn to deal with data variations and become more independent of the original training data set.
- Stopping Rule: When starting a model, you specify a maximum number of epochs after which the training is finished. In addition, it can make sense to stop the training at an early stage, for example, if it does not make any real progress over several epochs (the loss function no longer decreases) and there is thus a risk that the model runs into overfitting. In TensorFlow, a separate callback can be defined for this purpose:
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
- Feature Selection: A dataset often contains a variety of features that are given as input to the model. Such underfitting can occur when the model is not sufficiently complex to represent the complex underlying structure. Another problem may be that important features are missing from the data set that are needed to calculate the relationship.However, not all of them are really needed for the prediction of the correct result. Some features may even be correlated, i.e. interdependent. If a large number of features are available, a preselection should be made with the help of suitable algorithms. Otherwise, the complexity of the model increases and the risk of overfitting is high.
Overfitting vs. Underfitting
However, care must be taken when avoiding overfitting, as it can also drift to the other extreme, known as underfitting. We speak of underfitting when the model is not able to recognize an underlying structure from the given data and thus also delivers poor results in generalization. Underfitting can be recognized by the fact that the training error stagnates at a high level and does not decrease any further.
Such underfitting can occur when the model is not sufficiently complex to represent the complex underlying structure. Another problem may be that important features are missing from the data set that are needed to calculate the relationship.
This is what you should take with you
- Overfitting is a term from the field of data science and describes the property of a model to adapt itself too strongly to the training data set.
- It is usually recognizable by the fact that the error in the test data set increases again, while the error in the training data set continues to decrease.
- Among other things, overfitting can be prevented by inserting an early stopping rule or by removing features from the data set in advance that are correlated with other features.
Explanation of Recurrent Neural Networks and LSTM models with example.
Other Articles on the Topic of Overfitting
- IBM offers an interesting article on overfitting and was also used as a source for this post.