Overfitting is a term from the field of data science and describes the property of a model to adapt too strongly to the training data set. As a result, the model performs poorly on new, unseen data. However, the goal of a Machine Learning model is a good generalization, so the prediction of new data becomes possible.
What is Overfitting?
The term overfitting is used in the context of predictive models that are too specific to the training data set and thus learn the scatter of the data along with it. This often happens when the model has too complex a structure for the underlying data. The problem is that the trained model generalizes poorly, i.e., provides inadequate predictions for new, unseen data. The performance on the training data set, on the other hand, was excellent, which is why one could assume a high model quality.
Some factors may indicate an impending overfitting early on:
- Small Data Set: If there are only a few individual data sets in the training, the probability is very high that these are simply learned by heart, and far too little information is available to be able to learn an underlying structure. The more training parameters the model has, the more problematic it becomes. A neural network, for example, has a large number of parameters on each hidden layer. Therefore, the more complex the model, the larger the data set should be.
- Selection of the Training Dataset: If the selection of datasets is already unbalanced, there is a high probability that the model will train them and thus have poor generalization. The sample from a population should always be randomly chosen so that selection bias does not occur. To make an extrapolation during an election, not only the voters at one polling place should be surveyed, as they are not representative of the whole country, but only represent the opinion in that constituency.
- Many Training Epochs: A model trains several epochs and in each epoch has the goal of further minimizing the loss function and thereby increasing the quality of the model. However, at a certain point, only improvements in backpropagation can be achieved by adapting more to the training dataset.
How do you recognize Overfitting?
Unfortunately, there is no central analysis that can determine with certainty whether a model is overfitted or not. However, there are some parameters and analyses that can provide indications of impending overfitting. The best and simplest method is to look at the error curve of the model over the iterations.
If the error in the training data set continues to decrease, but the error in the validation data set begins to increase again, this indicates that the model fits the training data too closely and thus generalizes poorly. The same evaluation can be done with the loss function.
To build such a graph, you need the so-called validation or test set, i.e. unseen data for the model. If the data set is large enough, you can usually split off 20-30% of the data set and use it as a test data set. Otherwise, there is also the possibility to use the so-called k-fold cross-validation, which is somewhat more complex and can therefore also be used for smaller data sets.
The data set is divided into k blocks of equal size. One of the blocks is randomly selected and serves as the test data set and the other blocks in turn are the training data. However, in the second training step, another block is defined as the test data, and the process repeats.
The number of blocks k can be chosen arbitrarily and in most cases, a value between 5 and 10 is chosen. A too large value leads to a less biased model, but the risk of overfitting increases. A too small k value leads to a more biased model, as it then actually corresponds to the hold-out method.
How to prevent Overfitting?
There are many different ways to prevent overfitting, or at least reduce the likelihood of it. From the following suggestions, in many cases already two should be sufficient to keep the risk of overfitting low:
- Data Set: The data set plays a very important role in avoiding overfitting. It should be as large as possible and contain different data. Furthermore, enough time should have been spent on the data preparation process. If incorrect or missing data occurs too frequently, the complexity increases and the risk of overfitting increases accordingly. A clean data set, on the other hand, makes it easier for the model to recognize the underlying structure.
- Data Augmentation: In many applications, such as image recognition, individual data sets are used and given to the model with slight modifications for training. These changes can be, for example, a black and white copy of an image or the same text with some typos in it. This makes the model more stable and helps it learn to deal with data variations and become more independent of the original training data set.
- Stopping Rule: When starting a model, you specify a maximum number of epochs after which the training is finished. In addition, it can make sense to stop the training at an early stage, for example, if it does not make any real progress over several epochs (the loss function no longer decreases) and there is thus a risk that the model runs into overfitting. In TensorFlow, a separate callback can be defined for this purpose:
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
- Feature Selection: A dataset often contains a variety of features that are given as input to the model. Such underfitting can occur when the model is not sufficiently complex to represent the complex underlying structure. Another problem may be that important features are missing from the data set that is needed to calculate the relationship. However, not all of them are really needed for the prediction of the correct result. Some features may even be correlated, i.e. interdependent. If a large number of features are available, a preselection should be made with the help of suitable algorithms. Otherwise, the complexity of the model increases and the risk of overfitting is high.
Overfitting vs. Underfitting
Overfitting and underfitting are two common Machine Learning problems that occur when a model does not generalize well to new, unseen data. In the case of overfitting, the model becomes too complex and overfits the training data, resulting in poor performance on the test data. In contrast, underfitting occurs when the model is too simple to capture the underlying patterns in the data and results in poor performance on both training and test data.
However, care must be taken in avoiding overfitting, as it can also drift to the other extreme, known as underfitting. We speak of underfitting when the model is not able to recognize an underlying structure from the given data and thus also delivers poor results in generalization. Underfitting can be recognized by the fact that the training error stagnates at a high level and does not decrease further.
Such underfitting can occur when the model is not sufficiently complex to be able to represent the complex underlying structure. Another problem may be that important features are missing from the data set that is needed to calculate the relationship.
To understand the difference between overfitting and underfitting, it is helpful to look at the tradeoff between bias and variance. Bias refers to the tendency of the model to make certain assumptions about the data, while variance refers to its sensitivity to noise in the data. A model with high bias and low variance is likely to not fit the data, while a model with low bias and high variance is likely to overfit the data.
Finding the right balance between overfitting and underfitting is essential for developing accurate and reliable Machine Learning models. By using appropriate techniques such as regularization, cross-validation, feature selection, early stopping, and ensemble methods, it is possible to develop models that generalize well to new, unseen data.
What are the consequences of overfitting in the application?
Overfitting can have significant consequences in the real world, especially in applications where the model’s predictions are used to inform decisions. Some examples of the consequences of overfitting are:
- Poor generalization: Overfitted models are highly tuned to the training data and therefore often do not generalize well to unseen data. This means that the model performs well on the training data but poorly on the new data, making it useless in real-world applications.
- Decision-making biases: Overfitted models can lead to biased results that favor certain outcomes or groups of data. This can be particularly problematic in decision-making applications, such as credit scoring or medical diagnosis, where even small deviations can have significant consequences.
- Wasted resources: Overfitting often leads to the development of complex models that require significant computational resources to train and deploy. This can result in wasted time, money, and computing power, especially if the resulting model is not useful in practice.
- Increased risk: Overfitted models can give the impression of accuracy, leading to overconfidence in their predictions. This can increase the risk of making incorrect decisions or taking inappropriate actions based on flawed model results.
In summary, overfitting can have serious practical consequences, which is why it is important to develop models that balance complexity and generalization.
This is what you should take with you
- Overfitting is a term from the field of data science and describes the property of a model to adapt itself too strongly to the training data set.
- It is usually recognizable by the fact that the error in the test data set increases again, while the error in the training data set continues to decrease.
- Among other things, overfitting can be prevented by inserting an early stopping rule or by removing features from the data set in advance that are correlated with other features.
Other Articles on the Topic of Overfitting
- IBM offers an interesting article on overfitting and was also used as a source for this post.