What is the Ridge Regression?

Overfitting must always be considered when training meaningful machine learning models. With this problem, the model adapts too much to the training data and, therefore, only provides poor predictions for new, unseen data. Ridge regression, also known as L2 regularization, offers an effective solution to this problem when training a linear regression. By including an additional coefficient, the so-called regularization parameter, this architecture prevents the emergence of too large regression coefficients and thus reduces the risk of overfitting.

In the following article, we look at ridge regression and its mathematical principles. We also look in detail at how the results can be interpreted and highlight the differences from other regularization methods. Finally, we explain step by step, using a simple example, how to implement ridge regression in Python.

What is the Ridge Regression?

Ridge regression is a modification of linear regression that an additional regularization term has extended to avoid overfitting. In contrast to classic linear regression, which is trained to create an optimal model that minimizes the residuals between the prediction and the actual data, ridge regression also takes into account the size of the regression coefficients and attempts to prevent individual, very large model coefficients.

By avoiding very large coefficients, there is less likelihood that the model will learn excessively complex relationships from the training data and thus overfit. Rather, the regularization parameter helps the regression to generalize better, i.e. to recognize the underlying structure in the data and thus also deliver better results with new, unseen data.

What does the Ridge Regression look like mathematically?

Ridge regression is based on classical linear regression, which attempts to learn the parameters for a linear regression equation. For example, this regression equation looks like this:

\(\)\[y = \beta \cdot X + \epsilon\]

Here are:

\(y\) is the target variable that the model should predict, such as an optimal property value or the average fuel consumption of a car.

\(X\) contains values from the data set that could explain the size of the target variable, for example, the number of rooms in the property or the number of cylinders in an engine.

\(\beta\) is the so-called regression coefficient, which is learned during training and mathematically describes the relationship between \(X\) and \(y\). For example, a coefficient of 60,000 could mean that the value of a property increases by €60,000 with each additional room.
\(\epsilon\) is an error term that measures the deviations between the model’s prediction and the actual value. This can occur, for example, if the data set does not include all the relevant properties required for the prediction. In the real estate example, the age of the residential unit could be a decisive factor in determining the price. If this is not present in the data set, the model will probably not be able to predict the price accurately.

The aim of linear regression is now to learn the values of \(\beta\) in such a way that the residual sum of squares (RSS), i.e. the squared difference between the prediction \(\widehat{y_i}\) and the actual value \(y_i\), is minimized:

\(\)\[RSS = \sum_{i=1}^{n}{(y_i\ -\widehat{y_i})}^2\]

The L2 regularization now adds a further penalty term to this residual sum of squares, which is intended to ensure that the individual regression coefficients do not become too large. Mathematically, the ridge regularization then looks as follows:

\(\)\[\text{Cost Function} = RSS + \lambda \cdot \sum_{j=1}^{p}\beta_j^2\]

Here are:

RSS is the residual sum of squares, as already known from linear regression.
\(\lambda\) is the regularization parameter, which determines how strong the influence of the regularization is. We will take a closer look at its meaning in the next section.
\(\sum_{j=1}^{p}\beta_j^2\) is the sum of all squared regression coefficients. On the one hand, the square ensures that all regression coefficients have a positive sign and therefore have a negative influence on the cost function, on the other hand, very large values are significantly more important, as they are increased even more by squaring.

Due to this structure, the regularization term has an overall negative influence on the cost function. However, if individual coefficients become very large, this is significantly stronger than if the parameters do not grow too much. Therefore, the model has an incentive to keep the model coefficients as small as possible during training, as this is the only way it can minimize the cost function.

What does the Regularization Parameter \(\lambda\) do?

The regularization parameter \(\lambda\) is a decisive factor in ridge regression, which controls the strength of the penalty that a model receives if the regression coefficients increase too much. Since the parameter must always be positive, the following two cases can be distinguished:

\(\lambda = 0\): In this special case, the regularization term is eliminated and the ridge regression corresponds to the classic linear regression. Thus, the model behaves accordingly and only focuses on minimizing the residual sum of squares without taking the size of the coefficients into account. This increases the risk of overfitting, as the model may learn overly complex relationships in the training data.
\(\lambda \rightarrow \infty\): As \(\lambda\) increases, the strength of the penalty for the model increases. As a result, the regression focuses more on keeping the size of the coefficients small than on reducing the deviation between the prediction and the actual value. In the limiting case, the model can only reach the target by keeping the coefficients very close to 0. This leads to a straightforward model that is unlikely to recognize the correlations in the data and therefore only delivers poor results.

The choice of \(\lambda\) is a critical decision that significantly affects the performance of the ridge regression. A hyperparameter must be set before training and cannot be adjusted during training. A too-small \(\lambda\) can lead to overfitting, while a too-large \(\lambda\) leads to a simpler model that may not be able to recognize the underlying structures sufficiently.

Introducing an error term gives the model a systematic error, i.e. a distortion or bias, which leads to a smoothing of the model. In addition, the variance of the model is reduced, making it less sensitive to strongly fluctuating training data, as the coefficients are less large. These properties increase the generalization ability of the model and reduce the risk of overfitting.

How can the Ridge Regression be interpreted?

The interpretation of ridge regression follows that of linear regression in many areas, with the difference that the size of the coefficients is significantly influenced by the regularization parameter. The following principles of linear regression still apply:

The size of the coefficients \(\beta\) is decisive so that larger parameters indicate a stronger relationship between the independent and dependent variables.
The distances can also be used for a relative comparison by sorting the parameters according to their absolute size to decide which independent variable has the greatest influence on the dependent variable.
In addition, the signs retain the same meaning as in linear regression. Accordingly, a negative sign in front of a coefficient means that an increase in the independent variable leads to a decrease in the dependent variable and vice versa.

In addition to these similarities, however, it should be noted that the regularization parameter of the ridge regression has a decisive influence on the size of the coefficients. In general, it can be stated that the coefficients of the ridge regression are usually smaller than those of a comparable linear regression due to the regularization. This behavior is particularly pronounced when there is strong multicollinearity between the variables. Due to the structure of the model, the regularization parameter forces the coefficients to be closer to 0 without setting them completely to 0, as can happen with lasso regression, for example.

Overall, it can be said that ridge regression provides better results with multicollinear data by distributing the influence better and more evenly across the variables and also penalizing strongly correlated variables. This often makes the models easier to interpret than comparable linear regressions trained with the same data. In addition, it can be generalized that the smaller the regularization parameter \(\lambda\), the more the results and the interpretation of the ridge regression resemble the results of a comparable linear regression.

What are the Differences to Other Regularization Methods?

Ridge regression is one of the regularization methods that aims to prevent overfitting. Although it has similarities with other methods, such as L1 regularization or Elastic Net, it also differs significantly in its effects.

Ridge Regression vs L1 Regularization (Lasso)

Ridge regression is one of the various regularization techniques and has similarities with methods such as Lasso regression or Elastic Net. The distinction here mainly concerns the handling of the regression coefficient and the resulting properties for the models. In this section, we will therefore look at lasso regression, which is very similar in structure to ridge regression.

Lasso regression, also known as L1 regularization, adds up the absolute value of the coefficients in the regularization term:

\(\)\[\text{Cost Function} = RSS + \lambda \sum_{j=1}^{p}\left|\beta_j\right|\]

The regularization parameter \(\lambda\) has the same function here and determines the strength of the regularization. By summing the absolute coefficients, compared to the squared values, all coefficients are penalized equally for their size.

Due to this difference in structure, lasso regression can set the coefficients of individual variables completely to zero, which results in variable selection. Ridge regression, on the other hand, only reduces the size of all coefficients without setting them completely to zero. For this reason, ridge regression should primarily be selected if all predictors are to be retained and there is a problem with multicollinearity. Lasso regression, on the other hand, is suitable if you want to perform additional variable selection, as it can remove individual variables from the model by setting their coefficients to zero.

Ridge Regression vs Elastic Net

Elastic net combines the properties of lasso and ridge regression by including both L1 and L2 regularization. This results in the following cost function:

\(\)\[\text{Cost Function} = RSS + \lambda_1 \sum_{j=1}^{p}{\left|\beta_j\right|\ } + \lambda_2 \sum_{j=1}^{p}\beta_j^2\]

This structure combines the advantages of both architectures and the Elastic Net is particularly suitable for high-dimensional data where both variable selection and strongly correlated predictors need to be handled. The parameters \(\lambda_1\) and \(\lambda_2\) can also be used to customize the strength of these two approaches.

How can you implement the Ridge Regression in Python?

With the help of scikit-learn, ridge regression can be used very easily in Python. In this tutorial, we will look at a data set on Californian real estate prices and try to train a model that can predict these prices very accurately.

First, we import the necessary libraries. We import NumPy to handle the data set. We also need scikit-learn to use the ridge regression and to train the model using the train-test-split and then evaluate the performance independently.

We can simply import the data set from the scikit-learn data sets and then save the target variables and the data set with the features separately. We then split the data into training and test data sets in order to finally evaluate the generalization ability of the model by making predictions for new data.

Once we have prepared the data, we can now prepare the model that we have already imported. To do this, we set the regularization parameter to 1.0 and initialize the model. This can then be optimized using the training data set and the training labels.

We now use the trained model to calculate predictions for the test data set.

We use the mean squared error to calculate the deviation between the predictions for the test data set and the actual labels. This provides information on how good the prediction quality of the model really is.

The regularization parameter \(\lambda\) is a hyperparameter that must be defined before training. It therefore makes sense to train the model with different parameter values and compare their results. In our example, the results for different alphas are identical, so we had already selected the correct alpha value.

Finally, as an additional evaluation, the individual coefficients and their size can be visualized using Matplotlib. This provides an insight into which characteristics have the greatest influence on the predictions.

As you can see, scikit-learn can be used to train a functional ridge regression in just a few lines of code. The performance can be further improved using additional parameters as required.

This is what you should take with you

Ridge regression is a regularization method that is used to reduce the risk of overfitting when training linear regressions.
A so-called regularization term is added to the cost function, which penalizes the model if the size of the regression coefficients increases too much. This procedure is intended to prevent the model from adapting too strongly to the training data.
In general, the coefficients of a ridge regression are smaller than those of a comparable linear regression. Apart from this, the models and their coefficients can be compared similarly.
Ridge regression differs from lasso regression in that it cannot set the coefficients to zero and therefore does not select variables.
Ridge regression can be imported into Python using scikit-learn and is easy to use.