, we ensemble learning with voting, bagging and Random Forest.
Voting itself is only an aggregation mechanism. It does not create diversity, but combines predictions from already different models.
Bagging, on the other hand, explicitly creates diversity by training the same base model on multiple bootstrapped versions of the training dataset.
Random Forest extends bagging by additionally restricting the set of features considered at each split.
From a statistical point of view, the idea is simple and intuitive: diversity is created through randomness, without introducing any fundamentally new modeling concept.
But ensemble learning does not stop there.
There exists another family of ensemble methods that does not rely on randomness at all, but on optimization. Gradient Boosting belongs to this family. And to truly understand it, we will start with a deliberately strange idea:
We will apply Gradient Boosting to Linear Regression.
Yes, I know. This is probably the first time you have heard about applying Gradient Boosted Linear Regression.
(We will see Gradient Boosted Decision Trees, tomorrow).
In this article, here is the plan:
- First, we will step back and revisit the three fundamental steps of machine learning.
- Then, we will introduce the Gradient Boosting algorithm.
- Next, we will apply Gradient Boosting to linear regression.
- Finally, we will reflect on the relationship between Gradient Boosting and Gradient Descent.
1. Machine Learning in Three steps
To make machine learning easier to learn, I always separate it into three clear steps. Let us apply this framework to Gradient Boosted Linear Regression.
Because unlike bagging, each step reveals something interesting.
1. Model
A model is something that takes input features and produces an output prediction.
In this article, the base model will be Linear Regression.
1 bis. Ensemble Method Model
Gradient Boosting is not a model itself. It is an ensemble method that aggregates several base models into a single meta-model. On its own, it does not map inputs to outputs. It must be applied to a base model.
Here, Gradient Boosting will be used to aggregate linear regression models.
2. Model fitting
Each base model must be fitted to the training data.
For Linear Regression, fitting means estimating the coefficients. This can be done numerically using Gradient Descent, but also analytically. In Google Sheets or Excel, we can directly use the LINEST function to estimate these coefficients.
2 bis. Ensemble model learning
At first, Gradient Boosting may look like a simple aggregation of models. But it is still a learning process. As we will see, it relies on a loss function, exactly like classical models that learn weights.
3. Model tuning
Model tuning consists of optimizing hyperparameters.
In our case, the base model Linear Regression itself has no hyperparameters (unless we use regularized variants such as Ridge or Lasso).
Gradient Boosting, however, introduces two important hyperparameters: the number of boosting steps and the learning rate. We will see this in the next section.
In a nutshell, that is machine learning, made easy, in three steps!
2. Gradient Boosting Regressor algorithm
2.1 Algorithm principle
Here are the main steps of the Gradient Boosting algorithm, applied to regression.
- Initialization: We start with a very simple model. For regression, this is usually the average value of the target variable.
- Residual Errors Calculation: We compute residuals, defined as the difference between the actual values and the current predictions.
- Fitting Linear Regression to Residuals: We fit a new base model (here, a linear regression) to these residuals.
- Update the ensemble : We add this new model to the ensemble, scaled by a learning rate (also called shrinkage).
- Repeating the process: We repeat steps 2 to 4 until we reach the desired number of boosting iterations or until the error converges.
That’s it! This is the basic procedure for performing a Gradient Boosting applied to Linear Regression.
2.2 Algorithm expressed with formulas
Now we can write the formulas explicitly, it helps make each step concrete.
Step 1 – Initialization
We start with a constant model equal to the average of the target variable:
f0 = average(y)
Step 2 – Residual computation
We compute the residuals, defined as the difference between the actual values and the current predictions:
r1 = y − f0
Step 3 – Fit a base model to the residuals
We fit a linear regression model to these residuals:
r̂1 = a0 · x + b0
Step 4 – Update the ensemble
We update the model by adding the fitted regression, scaled by the learning rate:
f1 = f0 − learning_rate · (a0 · x + b0)
Next iteration
We repeat the same procedure:
r2 = y − f1
r̂2 = a1 · x + b1
f2 = f1 − learning_rate · (a1 · x + b1)
By expanding this expression, we obtain:
f2 = f0 − learning_rate · (a0 · x + b0) − learning_rate · (a1 · x + b1)
The same process continues at each iteration. Residuals are recomputed, a new model is fitted, and the ensemble is updated by adding this model with a learning rate.
This formulation makes it clear that Gradient Boosting builds the final model as a sum of successive correction models.
3. Gradient Boosted Linear Regression
3.1 Base model training
We start with a simple linear regression as our base model, using a small dataset of ten observations that I generated.
For the fitting of the base model, we will use a function in Google Sheet (it also works in Excel): LINEST to estimate the coefficients of the linear regression.

3.2 Gradient Boosting algorithm
The implementation of these formulas is straightforward in Google Sheet or Excel.
The table below shows the training dataset along with the different steps of the gradient boosting steps:

For each fitting step, we use the Excel function LINEST:

We will only do 2 iterations, and we can guess how it goes for more iterations. Here below is a graphic to show the models at each iteration. The different shades of red illustrate the convergence of the model and we also show the final model that is directly found with gradient descent applied directly to y.

3.3 Why Boosting Linear Regression is purely pedagogical
If you look carefully at the algorithm, two important observations emerge.
First, in step 2, we fit a linear regression to residuals, it will take time and algorithmic steps to achieve the model fitting steps, instead of fitting a linear regression to residuals, we can directly fit a linear regression to the actual values of y, and we already would find the final optimal model!
Secondly, when adding a linear regression to another linear regression, it is still a linear regression.
For example, we can rewrite f2 as:
f2 = f0 - learning_rate *(b0+b1) - learning_rate * (a0+a1) x
This is still a linear function of x.
This explains why Gradient Boosted Linear Regression does not bring any practical benefit. Its value is purely pedagogical: it helps us understand how the Gradient Boosting algorithm works, but it does not improve predictive performance.
In fact, it is even less useful than bagging applied to linear regression. With bagging, the variability between bootstrapped models allows us to estimate prediction uncertainty and construct confidence intervals. Gradient Boosted Linear Regression, on the other hand, collapses back to a single linear model and provides no additional information about uncertainty.
As we will see tomorrow, the situation is very different when the base model is a decision tree.
3.4 Tuning hyperparameters
There are two hyperparameters we can tune: the number of iterations and the learning rate.
For the number of iterations, we only implemented two, but it is easy to imagine more, and we can stop by examining the magnitude of the residuals.
For the learning rate, we can change it in Google Sheet and see what happens. When the learning rate is small, the “learning process” will be slow. And if the learning rate is 1, we can see that the convergence is achieved at iteration 1.

And the residuals of iteration 1 are already zeros.

If the learning rate is higher than 1, then the model will diverge.

4. Boosting as Gradient Descent in Function Space
4.1 Comparison with Gradient Descent Algorithm
At first glance, the role of the learning rate and the number of iterations in Gradient Boosting looks very similar to what we see in Gradient Descent. This naturally leads to confusion.
- Beginners often notice that both algorithms contain the word “gradient” and follow an iterative procedure. It is therefore tempting to assume that Gradient Descent and Gradient Boosting are closely related, without really knowing why.
- Experienced practitioners usually react differently. From their perspective, the two methods appear unrelated. Gradient Descent is used to fit weight-based models by optimizing their parameters, while Gradient Boosting is an ensemble method that combines multiple models fitted with the residuals. The use cases, the implementations, and the intuition seem completely different.
- At a deeper level, however, experts will say that these two algorithms are in fact the same optimization idea. The difference does not lie in the learning rule, but in the space where this rule is applied. Or we can say that the variable of interest is different.
Gradient Descent performs gradient-based updates in parameter space. Gradient Boosting performs gradient-based updates in function space.
That is the only difference in this mathematical numerical optimization. Let’s see the equations in the case of regression and in the general case below.
4.2 The Mean Squared Error Case: Same Algorithm, Different Space
With the Mean Squared Error, Gradient Descent and Gradient Boosting minimize the same objective and are driven by the same quantity: the residual.
In Gradient Descent, residuals influence the updates of the model parameters.
In Gradient Boosting, residuals directly update the prediction function.
In both cases, the learning rate and the number of iterations play the same role. The difference lies only in where the update is applied: parameter space versus function space.
Once this distinction is clear, it becomes evident that Gradient Boosting with MSE is simply Gradient Descent expressed at the level of functions.

4.3 Gradient Boosting with any loss function
The comparison above is not limited to the Mean Squared Error. Both Gradient Descent and Gradient Boosting can be defined with respect to different loss functions.
In Gradient Descent, the loss is defined in parameter space. This requires the model to be differentiable with respect to its parameters, which naturally restricts the method to weight-based models.
In Gradient Boosting, the loss is defined in prediction space. Only the loss must be differentiable with respect to the predictions. The base model itself does not need to be differentiable, and of course, it does not need to have its own loss function.
This explains why Gradient Boosting can combine arbitrary loss functions with non–weight-based models such as decision trees.

Conclusion
Gradient Boosting is not just a naive ensemble technique but an optimization algorithm. It follows the same learning logic as Gradient Descent, differing only in the space where the optimization is performed: parameters versus functions. Using linear regression allowed us to isolate this mechanism in its simplest form.
In the next article, we will see how this framework becomes truly powerful when the base model is a decision tree, leading to Gradient Boosted Decision Tree Regressors.
All the Excel files are available through this Kofi link. Your support means a lot to me. The price will increase during the month, so early supporters get the best value.
