How to build an ETL pipeline in Python
· Category: Data Science
Short answer
An ETL pipeline automates the movement of data from source systems through cleaning and transformation to a destination warehouse or database.
Steps
- Extract data from sources such as APIs, databases, or flat files.
- Validate extracted data for schema conformance and completeness.
- Transform data by cleaning, normalizing, aggregating, and joining.
- Load transformed data into the target system with appropriate indexing.
- Schedule and monitor the pipeline to ensure timely and accurate data delivery.
Tips
- Use idempotent loading patterns so re-running the pipeline does not duplicate data.
- Log every stage with row counts and error details for traceability.
- Store raw extracts in a data lake before transformation to enable reprocessing.
- Version control pipeline code and configuration alongside application code.
Common issues
- Schema drift in source systems breaking downstream transformations.
- Memory exhaustion when loading large datasets entirely into pandas.
- Race conditions when multiple pipeline instances run simultaneously.
- Data type mismatches between source and destination causing load failures.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.