How to split data for training and testing

Question

QA Hub Editorial · Accepted Answer

Short answer

Proper data splitting ensures that model evaluation reflects real-world performance by holding out unseen data that simulates future observations.

Reserve 70-80 percent of data for training, 10-15 percent for validation, and 10-15 percent for testing.
Use stratified splitting for classification to preserve class proportions across all sets.
For time-series data, split chronologically to avoid data leakage from future observations.
Set a random seed for reproducibility when using randomized splitting algorithms.
Lock the test set and only use it once after final model selection.

Data leakage occurs when preprocessing is fit on the entire dataset before splitting.
Small test sets lead to high variance in performance estimates.
Temporal leakage happens when future information appears in the training set.
Repeatedly tuning on the test set effectively turns it into a validation set.

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

This example generates a detailed classification report, illustrating how to evaluate model performance across multiple metrics in practice.