In this article, I’m going to demonstrate through a simple example, flowchart, and code — the entire logic implemented under the hood of a decision tree regressor (aka regression tree). After reading it, you will be able to get a clear idea behind the working of regression trees, and you will be more thoughtful & confident in using and tuning them in your next regression challenge.
We will cover the following:
- An awesome introduction to decision trees
- Generating a toy data set for training our regression tree
- Outlining the logic of the regression tree in the form of a flowchart
- Referencing the flowchart to write the code using NumPy and Pandas and make the first split
- Visualizing the decision tree after our first split, using Plotly
- Generalizing the code using recursion to build the entire regression tree
- Use scikit-learn to perform the same task and compare the results (spoiler: you’ll be so proud that you produced the same output as scikit-learn did!)
Decision trees are machine learning algorithms that can be used to solve both classification as well as regression problems. Even though classification and regression are inherently different from each other, decision trees try to approach both of these problems in an elegant way where the ultimate goal is to find the best split at a given node. And, how that best split is determined is what makes a classification tree and a regression tree different from each other.
In my previous article, I touched upon the basics of how a decision tree solves a classification problem. I used a two-class dataset to demonstrate a step-by-step process to understand how a decision rule is generated at each node using data impurity measures such as entropy, and then later implemented a recursive algorithm in Python to output the final decision tree. Not sure if you should add this article to your reading list? Let’s use a decision…