How to incrementally load data

· Category: Data Science

Short answer

Incremental loading updates only new or changed records since the last pipeline run, reducing processing time and resource consumption.

Steps

  1. Identify a watermark column such as a timestamp or auto-incrementing ID.
  2. Capture the maximum watermark value from the last successful run.
  3. Query source systems for records with watermark values greater than the stored maximum.
  4. Merge or append the delta into the destination table.
  5. Update the stored watermark for the next run.

Tips

  • Use change data capture tools like Debezium for database sources.
  • Handle late-arriving data with lookback windows or reconciliation jobs.
  • Partition destination tables to make delta merges efficient.
  • Track deletes separately if soft deletes are not present in source.

Common issues

  • Missing watermark columns in legacy source systems.
  • Clock skew causing records to be skipped or duplicated.
  • Schema changes breaking incremental extraction logic.
  • Race conditions when watermarks are updated before the load commits.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.