3 Essential Questions to Address When Building an API-Involved Incremental Data Loading Script | by Daniel Khoa Le | Jun, 2024

Editor
4 Min Read


Now let’s say you already extracted a bunch of records by making API requests with the above-mentioned params, it’s time for you to decide how you want to write them to the destination table.

👉 Answer: Merge/Dedup mode (recommended)

This question concerns the choice of Write disposition or Sync mode. The immediate answer is that, given you are looking to load your data incrementally, you will likely opt to write your extracted data in either append mode or merge mode (also known as deduplication mode).

However, let’s step back to examine our options more closely and determine which method is best suited for incremental loading.

Here are the popular write dispositions.

  • 🟪 overwrite/replace: drop all existing records in the destination tables and then insert the extracted records.
  • 🟪 append: simply append extracted records to the destination tables.
  • 🟪 merge / dedup: insert new(*) records and update(**) existing records.

(*) How do we know which records are new?: Usually, we will use a primary key to determine that. If you use dlt, their merging strategy can be more sophisticated than that, including the distinction between merge_key and primary_key (one is used for merging and one is used for dedupication before merging) or dedup_sort (which records are to be deleted with the same key in the dedup process). I will leave that part for another tutorial.

(**) This is a simple explanation, if you want to find out more about how dlt handles this merging strategy, read more here.

👁️👁️ Here is an example to help us understand the results of different write dispositions.

↪️ On 2024.06.19: We make the first sync.

🅰️ Data in source application️️

Image by Author

🅱️ ️Data loaded to our destination database

No matter what sync strategy you choose, the table at the destination is literally a copy of the source table.

Image by Author

Saved state of updated_at= 2024–06–03, which is the latest updated_at mong the 2 records we synced.

↪️ On 2024.06.2: We make the second sync.

🅰️ ️️️️️️️Data in source application

Image by Author

✍️ Changes in the source table:

  • Record id=1 was updated (sales figure).
  • Record id=2 was dropped.
  • Record id=3 was inserted.

At this sync, we ONLY extract records with the updated_at> 2024–06–03 (state saved from last sync). Therefore, we will extracted only record id=1 and id=3. Since record id=2 was removed from the source data, there is no way for us to recognize this change.

With the second sync, you now will see the difference among the write strategies.

🅱️ Data loaded to our destination database

Scenario 1: Overwrite

Image by Author

The destination table will be overwritten by the 2 records extracted this time.

Scenario 2: Append

Image by Author

The 2 extracted records will be appended to the destination table, the existing records are not affected.

Scenario 3: Merge or dedup

Image by Author

The 2 extracted records with id=1 and 3 will replace the existing records at destination. This processing is so called merging or deduplicating. Record id=2 in the destination table remains intact.

🟢 Takeaways: The merge (dedup) strategy can be effective in the incremental data loading pipeline, but if your table is very large, this dedup process might take a considerable amount of time.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.