How to use cross-validation properly

· Category: AI & Machine Learning

Short answer

Cross-validation estimates how well a model generalizes by rotating training and validation folds, reducing dependence on a single train-test split.

Steps

  1. Choose k between 5 and 10 based on dataset size and computational budget.
  2. Use stratified k-fold for classification to maintain class balance in each fold.
  3. For time-series, apply forward chaining or blocked cross-validation to respect temporal order.
  4. Fit preprocessing pipelines independently within each fold to prevent leakage.
  5. Aggregate metrics across folds and examine variance to assess stability.

Tips

  • Use repeated cross-validation when the dataset is small to reduce variance.
  • Leave-one-out cross-validation is unbiased but high variance for large datasets.
  • Ensure groups are kept together when data has hierarchical structure.
  • Compare cross-validation scores across multiple models on the same folds.

Common issues

  • Data leakage from fitting preprocessing on the entire dataset before cross-validation.
  • Using standard k-fold for time-series data destroys temporal causality.
  • Ignoring fold-level variance and reporting only the mean score.
  • Performing hyperparameter tuning outside cross-validation leads to optimistic estimates.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.