How to use scikit-learn for ML pipelines

Question

QA Hub Editorial · Accepted Answer

Short answer

Scikit-learn pipelines chain preprocessing steps and estimators into reusable objects that prevent data leakage and simplify model deployment.

Steps

Assemble preprocessing steps using transformers from sklearn.preprocessing.
Combine transformers with a classifier or regressor inside a Pipeline object.
Use GridSearchCV or RandomizedSearchCV to tune hyperparameters across the entire pipeline.
Fit the pipeline on training data so that preprocessing learns parameters only from training.
Serialize the fitted pipeline for consistent preprocessing during inference.

Tips

Use ColumnTransformer to apply different preprocessing to numerical and categorical features.
Create custom transformers by subclassing BaseEstimator and TransformerMixin.
Set memory caching in pipelines to avoid redundant computation during cross-validation.
Use FeatureUnion to combine multiple feature extraction pipelines in parallel.

Common issues

Fitting preprocessing on the full dataset before splitting causes data leakage.
Pipelines become hard to debug when too many custom steps are nested.
Inconsistent feature names between training and inference after one-hot encoding.
Failing to handle missing values before passing data to transformers that do not support them.

Example

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('clf', RandomForestClassifier())
])
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))

This pipeline chains scaling and classification, ensuring that preprocessing parameters are learned only from training data.

Short answer

Steps

Tips

Common issues

Example

Related Questions

How to build a neural network from scratch

What is the bias-variance tradeoff in machine learning

What is the difference between supervised and unsupervised learning

How to call the OpenAI API from Python

How to perform text classification with machine learning

How to build a sentiment analysis model