The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel

Contents

1. Machine Learning in Three Steps 2. Gradient Boosting Algorithm 2.1 Algorithm overview 2.2 Dataset 2.3 Initialization 2.4 First Tree 2.5 Model update 2.6 Repeating the Process 3. Understanding the Final Model 3.1 How the model evolves across iterations 3.2 Comparison with a single decision tree 3.3 General comparison with other models Conclusion

previous article, we introduced the core mechanism of Gradient Boosting through Gradient Boosted Linear Regression.

That example was deliberately simple. Its goal was not performance, but understanding.

Using a linear model allowed us to make every step explicit: residuals, updates, and the additive nature of the model. It also made the link with Gradient Descent very clear.

In this article, we move to the setting where Gradient Boosting truly becomes useful in practice: Decision Tree Regressors.

We will reuse the same conceptual framework as before, but the behavior of the algorithm changes in an important way. Unlike linear models, decision trees are non-linear and piecewise constant. When they are combined through Gradient Boosting, they no longer collapse into a single model. Instead, each new tree adds structure and refines the predictions of the previous ones.

For this reason, we will only briefly recap the general Gradient Boosting mechanism and focus instead on what is specific to Gradient Boosted Decision Trees: how trees are trained on residuals, how the ensemble evolves, and why this approach is so powerful.

1. Machine Learning in Three Steps

We will again use the same three-step framework to keep the explanation consistent and intuitive.

Three learning steps in Machine Learning – Image by author

1. Base model

we will use decision tree regressors as our base model.

A decision tree is non-linear by construction. It splits the feature space into regions and assigns a constant prediction to each region.

An important point is that when trees are added together, they do not collapse into a single tree.

Each new tree introduces additional structure to the model.

This is where Gradient Boosting becomes particularly powerful.

1 bis. Ensemble model

Gradient Boosting is the mechanism used to aggregate these base models into a single predictive model.

2. Model fitting

For clarity, we will use decision stumps, meaning trees with a depth of one and a single split.

Each tree is trained to predict the residuals of the previous model.

2 bis. Ensemble learning

The ensemble itself is built using gradient descent in function space.

Here, the objects being optimized are not parameters but functions, and those functions are decision trees.

3. Model tuning

Decision trees have several hyperparameters, such as:

maximum depth
minimum number of samples required to split
minimum number of samples per leaf

In this article, we fix the tree depth to one.

At the ensemble level, two additional hyperparameters are essential:

the learning rate
the number of boosting iterations

These parameters control how fast the model learns and how complex it becomes.

2. Gradient Boosting Algorithm

The Gradient Boosting algorithm follows a simple and repetitive structure.

2.1 Algorithm overview

Here are the main steps of the Gradient Boosting algorithm

Initialization
Start with a constant model. For regression with squared loss, this is the average value of the target.
Residual computation
Compute the residuals between the current predictions and the observed values.
Fit a weak learner
Train a decision tree regressor to predict these residuals.
Model update
Add the new tree to the existing model, scaled by a learning rate.
Repeat
Iterate until the chosen number of boosting steps is reached or the error stabilizes.

2.2 Dataset

To illustrate the behavior of Gradient Boosted Trees, we will use several types of datasets that I generated:

Piecewise linear data, where the relationship changes by segments
Non-linear data, such as curved patterns
Binary targets, for classification tasks

For classification, we will start with the squared loss for simplicity. This allows us to reuse the same mechanics as in regression. The loss function can later be replaced by alternatives better suited to classification, such as logistic or exponential loss.

These different datasets help highlight how Gradient Boosting adapts to various data structures and loss functions while relying on the same underlying algorithm.

Datasets for Gradient Boosted Decision Tree Regressor – all image by author

2.3 Initialization

The Gradient Boosting process starts with a constant model.
For regression with squared loss, this initial prediction is simply the average value of the target variable.

This average value represents the best initial prediction before any structure is learned from the features.

It is also a good opportunity to recall: almost every regression model can be seen as an improvement over the global average.

k-NN looks for similar observations, and predicts with the average value of their neighbors.
Decision Tree Regressors split the dataset into regions and compute the average value within each leaf to predict for a new observation that falls into this leaf.
Weight-based models adjust feature weights to balance or update the global average, for a given new observation.

Here, for gradient boosting, we also start with the average value. And then we will see how it will be progressively corrected.

2.4 First Tree

The first decision tree is then trained on the residuals of this initial model.

After the initialization, the residuals are just the differences between the observed values and the average.

To build this first tree, we use exactly the same procedure as in the article on Decision Tree Regressors.

The only difference is the target: instead of predicting the original values, the tree predicts the residuals.

This first tree provides the initial correction to the constant model and sets the direction for the boosting process.

2.5 Model update

Once the first tree has been trained on the residuals, we can compute the first improved prediction.

The updated model is obtained by combining the initial prediction and the first tree’s correction:

f1(x) = f0 + learning_rate * h1(x)

where:

f0 is the initial prediction, equal to the average value of the target
h1(x) is the prediction of the first tree trained on the residuals
learning_rate controls how much of this correction is applied

This update step is the core mechanism of Gradient Boosting.
Each tree slightly adjusts the current predictions instead of replacing them, allowing the model to improve progressively and remain stable.

2.6 Repeating the Process

Once the first update has been applied, the same procedure is repeated.

At each iteration, new residuals are computed using the current predictions, and a new decision tree is trained to predict these residuals. This tree is then added to the model using the learning rate.

To make this process easier to follow in Excel, the formulas can be written in a way that is fully automated. Once this is done, the formulas for the second tree and all subsequent trees can simply be copied to the right.

As the iterations progress, all the predictions of the residual models are grouped together. This makes the structure of the final model very clear.

At the end, the prediction can be written in a compact form:

f(x) = f0 + eta * (h1(x) + h2(x) + h3(x) + …)

This representation highlights an important idea: the final model is simply the initial prediction plus a weighted sum of residual predictions.

It also opens the door to possible extensions. For example, the learning rate does not have to be constant. It can decrease over time, following a decay through the iteration process.

It is the same idea for the decay in gradient descent or stochastic gradient descent.

3. Understanding the Final Model

3.1 How the model evolves across iterations

We start with a piecewise dataset. In the visualization below, we can see all the intermediate models produced during the Gradient Boosting process.

First, we see the initial constant prediction, equal to the average value of the target.

Then comes f1, obtained after adding the first tree with a single split.

Next, f2, after adding a second tree, and so on.

Each new tree introduces a local correction. As more trees are added, the model progressively adapts to the structure of the data.

The same behavior appears with a curved dataset. Even though each individual tree is piecewise constant, their additive combination results in a smooth curve that follows the underlying pattern.

When applied to a binary target, the algorithm still works, but some predictions can become negative or greater than one. This is expected when using squared error loss, which treats the problem as regression and does not constrain the output range.

If probability-like outputs are required, a classification-oriented loss function, such as logistic loss, should be used instead.

In conclusion, Gradient Boosting can be applied to different types of datasets, including piecewise, non-linear, and binary cases. Regardless of the dataset, the final model remains piecewise constant by construction, since it is built as a sum of decision trees.

However, the accumulation of many small corrections allows the overall prediction to closely approximate complex patterns.

3.2 Comparison with a single decision tree

When showing these plots, a natural question often arises:
Does Gradient Boosting not end up creating a tree, just like a Decision Tree Regressor?

This impression is understandable, especially when working with a small dataset. Visually, the final prediction can look similar, which makes the two approaches harder to distinguish at first glance.

However, the difference becomes clear when we look at how the splits are computed.

A single Decision Tree Regressor is built through a sequence of splits. At each split, the available data is divided into smaller subsets. As the tree grows, each new decision is based on fewer and fewer observations, which can make the model sensitive to noise.

Once a split is made, data points that fall into different regions are no longer related. Each region is treated independently, and early decisions cannot be revised.

Gradient Boosted Trees work in a completely different way.

Each tree in the boosting process is trained using the entire dataset. No observation is ever removed from the learning process. At every iteration, all data points contribute through their residuals.

This changes the behavior of the model fundamentally.

A single tree makes hard, irreversible decisions. Gradient Boosting, on the other hand, allows later trees to correct the mistakes made by earlier ones.

Instead of committing to one rigid partition of the feature space, the model progressively refines its predictions through a sequence of small adjustments.

This ability to revise and improve earlier decisions is one of the key reasons why Gradient Boosted Trees are both robust and powerful in practice.

3.3 General comparison with other models

Compared to a single decision tree, Gradient Boosted Trees produce smoother predictions, reduce overfitting, and improve generalization.

Compared to linear models, they naturally capture non-linear patterns, automatically model feature interactions, and require no manual feature engineering.

Compared to non-linear weight-based models, such as kernel methods or neural networks, Gradient Boosted Trees offer a different set of trade-offs. They rely on simple, interpretable building blocks, are less sensitive to feature scaling, and require fewer assumptions about the structure of the data. In many practical situations, they also train faster and require less tuning.

These combined properties explain why Gradient Boosted Decision Tree Regressors perform so well across a wide range of real-world applications.

Conclusion

In this article, we showed how Gradient Boosting builds powerful models by combining simple decision trees trained on residuals. Starting from a constant prediction, the model is refined step by step through small, local corrections.

We saw that this approach adapts naturally to different types of datasets and that the choice of the loss function is essential, especially for classification tasks.

By combining the flexibility of trees with the stability of boosting, Gradient Boosted Decision Trees achieve strong performance in practice while remaining conceptually simple and interpretable.