How to incrementally load data

Question

QA Hub Editorial · Accepted Answer

Short answer

Incremental loading updates only new or changed records since the last pipeline run, reducing processing time and resource consumption.

Steps

Identify a watermark column such as a timestamp or auto-incrementing ID.
Capture the maximum watermark value from the last successful run.
Query source systems for records with watermark values greater than the stored maximum.
Merge or append the delta into the destination table.
Update the stored watermark for the next run.

Tips

Use change data capture tools like Debezium for database sources.
Handle late-arriving data with lookback windows or reconciliation jobs.
Partition destination tables to make delta merges efficient.
Track deletes separately if soft deletes are not present in source.

Common issues

Missing watermark columns in legacy source systems.
Clock skew causing records to be skipped or duplicated.
Schema changes breaking incremental extraction logic.
Race conditions when watermarks are updated before the load commits.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.

Short answer

Steps

Tips

Common issues

Example

Related Questions

How to optimize slow data pipelines

How to schedule data pipelines with Apache Airflow

How to build an ETL pipeline in Python

What is the difference between precision and recall

What is feature engineering and why is it important

How to preprocess data for machine learning models