How to build an ETL pipeline in Python

· Category: Data Science

Short answer

An ETL pipeline automates the movement of data from source systems through cleaning and transformation to a destination warehouse or database.

Steps

  1. Extract data from sources such as APIs, databases, or flat files.
  2. Validate extracted data for schema conformance and completeness.
  3. Transform data by cleaning, normalizing, aggregating, and joining.
  4. Load transformed data into the target system with appropriate indexing.
  5. Schedule and monitor the pipeline to ensure timely and accurate data delivery.

Tips

  • Use idempotent loading patterns so re-running the pipeline does not duplicate data.
  • Log every stage with row counts and error details for traceability.
  • Store raw extracts in a data lake before transformation to enable reprocessing.
  • Version control pipeline code and configuration alongside application code.

Common issues

  • Schema drift in source systems breaking downstream transformations.
  • Memory exhaustion when loading large datasets entirely into pandas.
  • Race conditions when multiple pipeline instances run simultaneously.
  • Data type mismatches between source and destination causing load failures.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.