Every time a machine learning model is trained, the question arises as to how good the model’s predictions are. This is where the so-called loss functions come into play, which measure how far the predictions differ from the actual values in the data set. Each model is set up in such a way that it attempts to minimize the loss function during the training process. The Mean Squared Error (MSE) is a standard loss function used for regression tasks.
In this article, we will examine the MSE in more detail and take a closer look at the mathematical calculation, the applications, advantages, and disadvantages, as well as the implementation in Python. This will give you a basic understanding of this loss function and enable you to use it confidently in your own projects.
What is the purpose of MSE?
In machine learning, evaluating the performance of machine learning models is essential. This is to ensure that good predictions are made even with new, unseen data and lead to convergence during the learning process. In the field of supervised learning, the difference between the prediction and the actual label in the data set is calculated. The greater this difference, the poorer the performance of the model. In practice, however, there are various so-called loss functions, which differ in the exact calculation. The mean squared error is a standardized loss function for many machine learning models, such as regressions or neural networks.
In general, the MSE and other loss functions are commonly used for the following reasons:
- To quantify error: “You can’t manage what you can’t measure.” This principle also applies to machine learning, as the model can only improve if the errors can be measured. A loss function ensures that poor predictions can be quantified and thus provides a mathematical basis for improvement.
- A common benchmark for evaluation: Common loss functions or key figures allow several models to be specifically compared with each other. This also serves the same purpose for the comparison of several training runs if, for example, hyperparameters have been adjusted. This makes it possible to decide independently whether the changes have led to an improvement or deterioration.
- Improving model performance: Loss functions are not only used for the final evaluation of the model but also help to make the predictions more accurate during training. The aim is to minimize the error and better adapt the predictions to the actual ones. Algorithms such as gradient descent, which uses the loss function as a basis to change the model parameters in such a way that the model performance improves, help with this.
- Decision support: Loss functions are also used in real practice to decide whether or not to use a model. In portfolio management, for example, a stock price prediction is evaluated for accuracy to determine whether or not the financial risk of incorrect predictions is acceptable. The same logic is used to select algorithms in healthcare or manufacturing.
- Diagnostic tool: By analyzing the loss function in more detail, it is also possible to identify specific data points with which the model had problems. This allows weaknesses to be identified and the model can be adjusted accordingly.
The loss functions and the mean squared error are therefore not just a pure calculation of the deviation between the prediction and the actual values but also serve as a central instrument in machine learning and data analysis. The tasks range from the quantification of errors via a common comparison point to the diagnosis of weak points in the model. In the course of this article, we will therefore try to understand the MSE in more detail to be able to use it in a targeted manner.
How do you calculate the MSE (including an example)?
In mathematical terms, the mean squared error calculates the difference between the prediction and the actual label of the data point. This difference is then squared so that the sign of the deviation is irrelevant and large deviations (>1) carry more weight than small deviations (<1). These squared differences are then added up for all data points and the mean value is calculated by dividing the total sum by the number of data points.
Mathematical Formula
The formula for the Mean Squared Error is as follows:
\(\) \[MSE = \frac{1}{n} \sum_{i=1}^{n}(y_i – \hat{y}_i)^2\]
Where:
- MSE is the abbreviation for mean squared error.
- n stands for the total number of data points in the data set.
- 𝑦𝑖 denotes a single value of the ith data point in the data set.
- 𝑦̂𝑖 is the prediction made by the model for the ith data point.
Example in Python
The calculation of the mean squared error is required in many machine learning tasks and can also be easily performed in Python, as it is stored as a function in the Scikit-Learn library. In this example, we calculate an exemplary model for predicting real estate prices in California and then visualize the mean squared error of our model.
Step 1: Importing the required libraries
First, all libraries used for this example are imported. In addition to Scikit-Learn, this also includes Pandas and NumPy.
Step 2: Loading and preparing the dataset
We use the Californian housing data set, which can be loaded via Scikit-Learn. We will train a linear regression on it and therefore split the dataset into the target variable and the input variables.
Step 3: Splitting the data into training and test sets
To be able to independently evaluate the quality of the model, the data set is split into training and test sets so that predictions for unseen data can be generated afterward. A basic model can also be trained after this split.
Step 4: Calculating the mean squared error
After training the model, the mean square error for the test set can now be calculated to assess how well the model can make predictions for unseen data.
This value can now be used, for example, to compare several training runs with different input parameters or with a different model architecture.
Step 5: Create a visualization
The visualization can be used to clarify whether the mean squared error was characterized by a few blatant error predictions or whether all predictions deviated only slightly from the actual value. A scatter diagram can be used for this purpose, showing the actual values and the predictions and illustrating the difference. In a perfect model, all points would lie on a 45° straight line from bottom left to top right. The deviations are marked with red lines between the actual value and the corresponding prediction:
Here are the individual steps that the code executes:
- First, a scatter plot is created with the actual values (
y_test
) on the x-axis and the predicted values (y_pred
) on the y-axis as blue dots. - The difference between the prediction and the actual value is then calculated for each data point.
- This difference is then marked with a vertical red line between the points.
This visualization helps to identify outliers and provides insight into the magnitude and direction of the errors. Such a visualization is usually a good tool that should also be used with the mean squared error to gain a better understanding of the errors. Targeted measures to reduce the error can often be derived from this.
How do you interpret the MSE?
To use the mean squared error correctly, it is important to interpret the value correctly. Some important points should be taken into account to draw the right conclusions from the MSE:
- Variation in size: by squaring the difference, the mean squared error is always positive. Lower values, which are closer to zero, indicate a better prediction quality, as the predictions and the actual values in the data set are closer together. A high error value, on the other hand, indicates poor predictive performance of the model. However, it is important to note that the mean squared error is not standardized, meaning that the order of magnitude of the target variables influences the order of magnitude of the mean squared error. For example, if house prices are predicted, which are usually six figures, the mean squared error of a good model will still be higher than if the regression predicts food prices.
- Unit insight: It is important to note that when squaring, the units of the target variable are also squared. In our example in which house prices are to be predicted in dollars, the MSE then has the unit dollar squared accordingly.
- Model comparison: If different models were trained with the same data set, the model with the lowest mean squared error should be prioritized. A significantly lower error compared to the other algorithms also indicates a significantly better forecasting performance.
- Sensitivity to outliers: When interpreting the mean squared error, it should be noted that outliers with an exceptionally large error influence the mean squared error. Such outliers should therefore be investigated before conclusions can be drawn.
- Value of the residual analysis: Residuals are the difference between the actual and predicted values. In a good model, the residuals should be normally distributed so that they have a mean value of zero. If this is not the case, it indicates that there were problems fitting the model to the data. The mean squared error can be used for an initial assessment of the residuals.
- Consideration of scaling: As already mentioned, it is important to note that the scaling of the data also influences the scaling of the mean squared error. Therefore, the MSEs should not be compared between two models with different scales of the target variable, as this is not meaningful.
- Use of complementary metrics: The MSE, just like other evaluation metrics, should not be used as the only metric for the quality of a model. It is best to always look at a model from different angles and use other metrics in addition.
Overall, the mean squared error is a valuable tool for assessing the predictive quality of a regression model. A low value is an indicator of good predictive performance, but must always be interpreted in the context of the problem and should be used in conjunction with other metrics.
What are the advantages and disadvantages of the mean squared error?
The MSE is a widely used loss function that is used for a wide variety of regression models. However, like any measurement tool, it has its advantages and disadvantages:
Advantages of Mean Squared Error:
- Differentiability: Mathematically, the mean squared error is a continuous and derivable function, making it compatible with optimization techniques that perform derivations. This property is particularly useful when working with neural networks, for example, as these are based on gradient descent, which requires a derivable loss function.
- Sensitivity to deviations: By squaring the differences between the prediction and the actual value, large deviations are assigned a higher weight. This allows models to be trained that prevent predictions with large deviations.
- Mathematical properties: In addition to continuity and derivability, the mean squared error has other advantageous mathematical properties. These include, for example, the fact that the mean squared error is closely related to cross-entropy.
- Well-defined optimization: The MSE leads to a stable convergence of the model, as the mean square difference between predictions and actual values is minimized and there are no jumps or similarities.
Disadvantages of the mean squared error:
- Sensitivity to outliers: If data sets contain outliers that the model cannot predict well enough, the mean squared error may be a poor loss function as it does not accurately reflect the predictive quality of the model and can be heavily biased by individual incorrect predictions.
- Unit imbalance: As a result of squaring, the MSE does not have the same units as the prediction, which makes it much more difficult to interpret the value. To interpret the loss correctly, other errors must be used that are of the same order of magnitude as the target variable.
- No insight into the direction: Depending on the application, it is also interesting to know whether the model tends to overestimate or underestimate the target variable. This information is obscured by the MSE and it treats each error identically.
- Not robust to model assumptions: This loss function assumes that the errors follow a normal distribution with a constant variance. However, this may not be the case in some applications, so the model performance is not accurately measured by the MSE.
- Potential overfitting: If the model is too complex, overfitting can occur even with this error, so that the algorithm adapts too closely to the training data but only provides very poor predictions for new data.
- Limited applicability: The MSE is a standardized loss function and should only be used if the model requirements are also fairly standardized. For example, if errors have different costs, other loss functions should be used.
The Mean Squared Error is a preferred loss function due to its mathematical properties and sensitivity to large deviations. It can accurately measure model performance, especially in regression applications, but should be used with caution if the data set contains outliers that can lead to high deviations. In general, it makes sense to consult several evaluation criteria for the quality of a model.
Which applications use the mean squared error?
In the field of machine learning, the mean squared error is generally used in three areas of model development. These three phases are
- Model evaluation: After training, this error can be used to evaluate the trained model and its predictive performance. The simple calculation can be used to quickly assess whether good predictions can be made.
- Algorithm comparison: During training, the mean squared error can be used to compare different model architectures or even different algorithms for a data set. This makes it possible to differentiate between the tests and determine which model architecture is most suitable for the application.
- Model optimization during training: Many models can use the mean squared error as a loss function so that it ensures model optimization and convergence during training. Due to its good mathematical properties, it is particularly suitable for the gradient method, as it is derivable and continuous, which is a prerequisite for this algorithm.
This error is therefore a central loss function and a key figure in the field of machine learning, which is used in a wide variety of models and different stages of training.
What are extensions and alternatives to the mean squared error?
As the mean squared error does not only have advantages, slightly modified metrics have emerged over time that either contain extensions for certain scenarios or are specifically tailored to other applications. These adaptations have specifically improved the disadvantages of the mean squared error and provided a more comprehensive view of model performance. The most widely used alternatives are:
- Root Mean Squared Error (RMSE): This error is the square root of the Mean Squared Error. This allows the error to have the same unit as the target variable and is therefore easier to interpret. This is often preferred if the error is also to be used for interpretation and not just as a pure evaluation criterion for model performance.
- Mean Absolute Error (MAE): The Mean Absolute Error calculates the absolute differences between the predicted and actual values. This makes it less sensitive to outliers, as extreme errors are not penalized disproportionately more than smaller errors. It is often used in regression analyses in which outliers must be tolerated and should therefore not influence the evaluation of the model performance too strongly.
- Mean Absolute Percentage Error (MAPE): This error calculates the average percentage deviation between the predicted and actual values. It is widely used in business contexts where it is common to measure deviations in percentage increments.
Various key figures from this portfolio can be used for different applications. These are useful extensions of the widely used mean squared error.
This is what you should take with you
- The mean squared error is a widely used loss function that measures the average squared deviation between the prediction and the actual value.
- One advantage of this metric is that extreme deviations carry more weight than small ones. However, this has the disadvantage that the unit of error is the squared unit of the target variable, making it difficult to interpret.
- In addition, the MSE is sensitive to outliers and can be distorted by them so that the model performance is not evaluated correctly.
- The MSE is used in a wide variety of applications, such as finance or healthcare.
- Over the years, various extensions of the MSE have been developed that optimize the error for other applications and eliminate certain disadvantages.
What is Gibbs Sampling?
Explore Gibbs sampling: Learn its applications, implementation, and how it's used in real-world data analysis.
What is a Bias?
Unveiling Bias: Exploring its Impact and Mitigating Measures. Understand, recognize, and address bias in this insightful guide.
What is the Variance?
Explore variance's role in statistics and data analysis. Understand how it measures data dispersion.
What is the Kullback-Leibler Divergence?
Explore Kullback-Leibler Divergence, a vital metric in information theory and machine learning, and its applications.
What is the Maximum Likelihood Estimation?
Unlocking insights: Understand Maximum Likelihood Estimation (MLE), a potent statistical tool for parameter estimation and data modeling.
What is the Variance Inflation Factor (VIF)?
Learn how Variance Inflation Factor (VIF) detects multicollinearity in regression models for better data analysis.
Other Articles on the Topic of Mean Squared Error
IBM provides an interesting article on the topic that you can find here.
Niklas Lang
I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.
My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.