How to use cross-validation properly
· Category: AI & Machine Learning
Short answer
Cross-validation estimates how well a model generalizes by rotating training and validation folds, reducing dependence on a single train-test split.
Steps
- Choose k between 5 and 10 based on dataset size and computational budget.
- Use stratified k-fold for classification to maintain class balance in each fold.
- For time-series, apply forward chaining or blocked cross-validation to respect temporal order.
- Fit preprocessing pipelines independently within each fold to prevent leakage.
- Aggregate metrics across folds and examine variance to assess stability.
Tips
- Use repeated cross-validation when the dataset is small to reduce variance.
- Leave-one-out cross-validation is unbiased but high variance for large datasets.
- Ensure groups are kept together when data has hierarchical structure.
- Compare cross-validation scores across multiple models on the same folds.
Common issues
- Data leakage from fitting preprocessing on the entire dataset before cross-validation.
- Using standard k-fold for time-series data destroys temporal causality.
- Ignoring fold-level variance and reporting only the mean score.
- Performing hyperparameter tuning outside cross-validation leads to optimistic estimates.
Example
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.