How to incrementally load data
· Category: Data Science
Short answer
Incremental loading updates only new or changed records since the last pipeline run, reducing processing time and resource consumption.
Steps
- Identify a watermark column such as a timestamp or auto-incrementing ID.
- Capture the maximum watermark value from the last successful run.
- Query source systems for records with watermark values greater than the stored maximum.
- Merge or append the delta into the destination table.
- Update the stored watermark for the next run.
Tips
- Use change data capture tools like Debezium for database sources.
- Handle late-arriving data with lookback windows or reconciliation jobs.
- Partition destination tables to make delta merges efficient.
- Track deletes separately if soft deletes are not present in source.
Common issues
- Missing watermark columns in legacy source systems.
- Clock skew causing records to be skipped or duplicated.
- Schema changes breaking incremental extraction logic.
- Race conditions when watermarks are updated before the load commits.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.