Demystifying CDC: Understanding Change Data Capture in Plain Words | by Antonio Grandinetti | Mar, 2024

Editor
2 Min Read


Your essential guide to Change Data Capture

In my work experiences (in the field of Big Data analysis and Data Engineering), the projects are always different, but they always follow a consolidated schema: the goal is to create a data platform that collects data from different sources, performs a series of elaborations, and exposes the consolidated data to those who will then use it.

Photo by ian dooley on Unsplash

The schema just described is often summarized in the concepts of Data Lake/Data Lakehouse and ETL (Extract-Transform-Load) flows. The different ways of extracting data from source systems fall into two categories:

  • batch: the entire data set is extracted from the source in a single operation
  • streaming: the extraction is performed continuously, monitoring the source for any changes. Data is extracted as soon as it is modified

New technologies, new architectures and new approaches emerge every year, but one method that continues to be used frequently is Change Data Capture.

What is Change Data Capture (CDC)? 🤓

Change data capture is a design pattern that enables you to capture the changes that occur in a data source. It provides a continuous stream of data updates, which can be used for various purposes, such as:

  • Datalake/Data Lakehouse: Populating a datalake with incremental changes
  • Real-time analytics: Enabling real-time analysis of data changes
  • Event-driven applications: Triggering actions based on data changes
  • Data replication: Keeping multiple copies of data in sync

How does CDC work? 🧐

There are many approaches to implement this pattern, but the modern ones are the union of two concepts:

  • Transaction log: databases create a log with all the operations made on data
  • Pub/sub queues: the CDC system periodically polls the data source for changes (new rows in transaction log) and then publishes the changes in a queue

This approach involves using several components and is ideal for use cases where real-time and decoupled…

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.