Gradient Boosting is a Machine Learning method that combines several so-called “weak learners” into a powerful model for classifications or regressions. In this article, we look at the concepts of boosting and ensemble learning and explain how gradient boosting works.
What are Ensemble Learning and Boosting in Machine Learning?
In Machine Learning, it is not always just individual models that are used. To improve the performance of the entire program, several individual models are sometimes combined into a so-called ensemble. A random forest, for example, consists of many individual decision trees whose results are then combined into one result. The basic idea behind this is the so-called “Wisdom of Crowds”, which states that the expected value of several independent estimates is better than each estimate. This theory was formulated after the weight of an ox at a medieval fair was estimated by no single individual as accurately as by the average of the individual estimates.
Boosting describes the procedure of combining multiple models into an ensemble. Using Decision Trees as an example, the training data is used to train a tree. For all the data for which the first decision tree gives bad or wrong results, a second decision tree is formed. This is then trained using only the data that the first misclassified. This chain is continued and the next tree in turn uses the information that led to bad results in the first two trees.

The ensemble of all these decision trees can then provide good results for the entire data set since each model compensates for the weaknesses of the others. It is also said that many “weak learners” are combined into one “strong learner”.
We speak of weak learners because in many cases they deliver rather poor results. Their accuracy is in many cases better than simply guessing, but not significantly better. However, they offer the advantage that they are easy to compute in many cases and thus can be combined easily and cheaply.
What is Gradient Boosting?
Gradient boosting, in turn, is a subset of many, different boosting algorithms. The basic idea behind it is that the next model should be built in such a way that it further minimizes the ensemble loss function.
In the simplest cases, the loss function simply describes the difference between the model’s prediction and the actual value. Suppose we train an AI to predict a house price. The loss function could then simply be the Mean Squared Error between the actual price of the house and the predicted price of the house. Ideally, the function approaches zero over time and our model can predict correct prices.
New models are added as long as the prediction and reality no longer differ, i.e. the loss function has reached the minimum. Each new model tries to predict the error of the previous model.
Let’s go back to our example with house prices. Assume a property has a living area of 100m², four rooms, and a garage, and costs 200.000€. The gradient boosting procedure would then look like this:
Training a regression to predict the purchase price with the features living area, number of rooms, and garage. This model predicts a purchase price of 170,000 € instead of the actual 200,000 €, so the error is 30,000 €.
Training another regression that predicts the error of the previous model with the features of living space, number of rooms, and garage. This model predicts a deviation of 23,000 € instead of the actual 30,000 €. The remaining error is therefore 7,000 €.
What are the advantages and disadvantages of boosting in general?
The general advantage of boosting is that many weak learners are combined into one strong model. Despite a large number of small models, these boosting algorithms are usually easier to compute than comparable neural networks. However, this does not necessarily mean that they also produce worse results. In some cases, ensemble models can even beat the more complex networks in terms of accuracy. Thus, they are also interesting candidates for text or image classifications.
Furthermore, boosting algorithms, such as AdaBoost, are also less prone to overfitting. This simply means that they not only perform well with the training dataset but also classify well with new data with high accuracy. It is believed that the multilevel model computation of boosting algorithms is not as prone to dependencies as the layers in a neural network, since the models are not optimized contiguously as is the case with backpropagation in the model.
Due to the stepwise training of single models, boosting models often have a relatively slow learning rate and thus require more iterations to deliver good results. Furthermore, they require very good data sets, since the models are very sensitive to noise and this should be removed in the data preprocessing.
Gradient Boosting vs. AdaBoost
With AdaBoost, many different decision trees with only one decision level, so-called decision stumps, are trained sequentially with the errors of the previous models. Gradient Boosting, on the other hand, tries to minimize the loss function further and further by training the following models to use the so-called decision stumps.

Thus, Gradient Boosting can be used for regressions, i.e. the prediction of continuous values, as well as for classifications, i.e. the classification into groups. The AdaBoost algorithm, on the other hand, can only be used for classifications. This is also the main difference between these two boosting algorithms because, in the core idea, both of them try to combine weak learners into a strong model through sequential learning and the higher weighting of incorrect predictions.
Which boosting algorithm to choose?
Choosing the right boosting algorithm depends on several factors, such as the size and complexity of the dataset, the level of interpretability required, and the computational resources available.
Below is a brief overview of the three common boosting algorithms you mentioned:
- AdaBoost (Adaptive Boosting) is a widely used boosting algorithm that combines several weak classifiers into one strong classifier. It assigns weights to the training samples and adjusts these weights in each iteration to focus on the misclassified samples. AdaBoost is a good choice for simple classification tasks with medium-sized datasets.
- XGBoost (Extreme Gradient Boosting) is a popular and powerful boosting algorithm that uses decision trees as base learners. It uses a regularized approach to prevent overfitting and can handle large datasets with high-dimensional features. XGBoost is computationally efficient and can be used for both regression and classification problems.
- Gradient Boosting is a general boosting algorithm that can be used with various loss functions and base learners. It works by iteratively adding weak learners to form a strong learner that minimizes the loss function. Gradient boosting is flexible and can handle different types of data, including categorical features.
In summary, AdaBoost can be a good choice if you have a simple classification task with medium-sized datasets. If you have a large dataset with high-dimensional features and want to avoid overfitting, XGBoost might be a better choice. Gradient boosting is a versatile algorithm that can be used for different types of data and loss functions.
This is what you should take with you
- Gradient Boosting is a Machine Learning method from the field of boosting algorithms.
- The goal is to combine several so-called “weak learners” into one powerful model.
- In gradient boosting, several models are trained in succession, each of which tries to predict the previous error well.
Other Articles on the Topic of Gradient Boosting
A tutorial for gradient boosting in Scikit-Learn can be found here.