Simulated Data, Real Learnings: Part 1 | by Jarom Hulet

Testing machine learning approaches with simulation

16 min read

19 hours ago

distributions of model estimated coefficients on simulated data — image by author

Simulation is a powerful tool in the data science tool box. This is the first part of a multi-part series that discusses various ways that simulation can be useful in data science and machine learning. In this article, we will cover how simulation can be used to test machine learning approaches.

Specifically we’ll go over how simulation can be used in the three ways below:

Testing machine learning approaches
Comparing different machine learning model performance
Evaluate model behavior in various circumstances

Before diving into this specific application of data simulation, let’s define simulation.

WHAT IS DATA SIMULATION?

The definition of data simulation is pretty simple — it is the creation of fictitious data that mimics the properties of real-world data.

When do we want to simulate data?

when we want to have the ‘answer’ to the questions that are not observable in the real world — i.e. with real world data, we can only infer the relationship between X and y; but with simulated data we create the relationship between X and y — with this ‘answer’ we can test our machine learning and analytical approaches to see if they discover the relationship we simulated
when we don’t have real data or we have very limited data
when we want to simulate things that have never happened before

Simulated data is often created using some amount of randomness. We will typically draw the randomness from probability distributions based on observed data or domain knowledge. For example, if we want to simulate productivity of orange trees, we could randomly draw from a distribution of orange tree productivity. We could create the probability distribution through observation (if we have a dataset of many orange trees’ productivity) or we could draw from a statistical distribution that represents orange productivity — e.g. orange tree productivity is normally distributed with a mean of 150 lbs and a standard deviation of 24 lbs (I totally made this up, don’t fact check me!).