How to split data for training and testing

· Category: AI & Machine Learning

Short answer

Proper data splitting ensures that model evaluation reflects real-world performance by holding out unseen data that simulates future observations.

Steps

  1. Reserve 70-80 percent of data for training, 10-15 percent for validation, and 10-15 percent for testing.
  2. Use stratified splitting for classification to preserve class proportions across all sets.
  3. For time-series data, split chronologically to avoid data leakage from future observations.
  4. Set a random seed for reproducibility when using randomized splitting algorithms.
  5. Lock the test set and only use it once after final model selection.

Tips

  • Shuffle data before splitting when temporal order is irrelevant.
  • Consider nested cross-validation when the dataset is small.
  • Document the splitting strategy to ensure reproducibility.
  • Use group-based splits when multiple rows belong to the same entity.

Common issues

  • Data leakage occurs when preprocessing is fit on the entire dataset before splitting.
  • Small test sets lead to high variance in performance estimates.
  • Temporal leakage happens when future information appears in the training set.
  • Repeatedly tuning on the test set effectively turns it into a validation set.

Example

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

This example generates a detailed classification report, illustrating how to evaluate model performance across multiple metrics in practice.