How to use cross-validation properly

Question

QA Hub Editorial · Accepted Answer

Short answer

Cross-validation estimates how well a model generalizes by rotating training and validation folds, reducing dependence on a single train-test split.

Steps

Choose k between 5 and 10 based on dataset size and computational budget.
Use stratified k-fold for classification to maintain class balance in each fold.
For time-series, apply forward chaining or blocked cross-validation to respect temporal order.
Fit preprocessing pipelines independently within each fold to prevent leakage.
Aggregate metrics across folds and examine variance to assess stability.

Tips

Use repeated cross-validation when the dataset is small to reduce variance.
Leave-one-out cross-validation is unbiased but high variance for large datasets.
Ensure groups are kept together when data has hierarchical structure.
Compare cross-validation scores across multiple models on the same folds.

Common issues

Data leakage from fitting preprocessing on the entire dataset before cross-validation.
Using standard k-fold for time-series data destroys temporal causality.
Ignoring fold-level variance and reporting only the mean score.
Performing hyperparameter tuning outside cross-validation leads to optimistic estimates.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.

Short answer

Steps

Tips

Common issues

Example

Related Questions

How to evaluate machine learning model performance

How to reduce bias in AI models

How to build a recommendation system

How to use Weights and Biases for logging

How to evaluate a classification model

How to prevent overfitting in machine learning models