How to use scikit-learn for ML pipelines
· Category: AI & Machine Learning
Short answer
Scikit-learn pipelines chain preprocessing steps and estimators into reusable objects that prevent data leakage and simplify model deployment.
Steps
- Assemble preprocessing steps using transformers from sklearn.preprocessing.
- Combine transformers with a classifier or regressor inside a Pipeline object.
- Use GridSearchCV or RandomizedSearchCV to tune hyperparameters across the entire pipeline.
- Fit the pipeline on training data so that preprocessing learns parameters only from training.
- Serialize the fitted pipeline for consistent preprocessing during inference.
Tips
- Use ColumnTransformer to apply different preprocessing to numerical and categorical features.
- Create custom transformers by subclassing BaseEstimator and TransformerMixin.
- Set memory caching in pipelines to avoid redundant computation during cross-validation.
- Use FeatureUnion to combine multiple feature extraction pipelines in parallel.
Common issues
- Fitting preprocessing on the full dataset before splitting causes data leakage.
- Pipelines become hard to debug when too many custom steps are nested.
- Inconsistent feature names between training and inference after one-hot encoding.
- Failing to handle missing values before passing data to transformers that do not support them.
Example
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([
('scale', StandardScaler()),
('clf', RandomForestClassifier())
])
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))
This pipeline chains scaling and classification, ensuring that preprocessing parameters are learned only from training data.