How to use scikit-learn for ML pipelines

· Category: AI & Machine Learning

Short answer

Scikit-learn pipelines chain preprocessing steps and estimators into reusable objects that prevent data leakage and simplify model deployment.

Steps

  1. Assemble preprocessing steps using transformers from sklearn.preprocessing.
  2. Combine transformers with a classifier or regressor inside a Pipeline object.
  3. Use GridSearchCV or RandomizedSearchCV to tune hyperparameters across the entire pipeline.
  4. Fit the pipeline on training data so that preprocessing learns parameters only from training.
  5. Serialize the fitted pipeline for consistent preprocessing during inference.

Tips

  • Use ColumnTransformer to apply different preprocessing to numerical and categorical features.
  • Create custom transformers by subclassing BaseEstimator and TransformerMixin.
  • Set memory caching in pipelines to avoid redundant computation during cross-validation.
  • Use FeatureUnion to combine multiple feature extraction pipelines in parallel.

Common issues

  • Fitting preprocessing on the full dataset before splitting causes data leakage.
  • Pipelines become hard to debug when too many custom steps are nested.
  • Inconsistent feature names between training and inference after one-hot encoding.
  • Failing to handle missing values before passing data to transformers that do not support them.

Example

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('clf', RandomForestClassifier())
])
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))

This pipeline chains scaling and classification, ensuring that preprocessing parameters are learned only from training data.