I recently made a career shift slightly away from data science to a more engineering-heavy role. My team is building a high-quality data warehouse to feed the organization’s BI and ML needs. I took this position because I saw it as an opportunity to use the insight I’ve gained from my data science roles to influence the design and development of a data warehouse with a forward-looking focus.
In nearly every data science role I’ve worked in over the past 6 years, I noticed a common theme — the data infrastructure wasn’t designed with data science in mind. Many tables in the data warehouse/lakehouse, usually facts and dimensions, lacked critical fields or structure required to build performant machine learning models. The most prevalent limitation I noticed was that most tables only tracked the current state of an observation rather than the history.
This article explores this common problem and demonstrates how to address it through a data modeling technique called Slowly Changing Dimensions. By the end, you’ll understand the impact of storing historical data on your model’s performance, and you’ll have strategies to help you implement this for your use cases.