How to split data for training and testing
· Category: AI & Machine Learning
Short answer
Proper data splitting ensures that model evaluation reflects real-world performance by holding out unseen data that simulates future observations.
Steps
- Reserve 70-80 percent of data for training, 10-15 percent for validation, and 10-15 percent for testing.
- Use stratified splitting for classification to preserve class proportions across all sets.
- For time-series data, split chronologically to avoid data leakage from future observations.
- Set a random seed for reproducibility when using randomized splitting algorithms.
- Lock the test set and only use it once after final model selection.
Tips
- Shuffle data before splitting when temporal order is irrelevant.
- Consider nested cross-validation when the dataset is small.
- Document the splitting strategy to ensure reproducibility.
- Use group-based splits when multiple rows belong to the same entity.
Common issues
- Data leakage occurs when preprocessing is fit on the entire dataset before splitting.
- Small test sets lead to high variance in performance estimates.
- Temporal leakage happens when future information appears in the training set.
- Repeatedly tuning on the test set effectively turns it into a validation set.
Example
from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
This example generates a detailed classification report, illustrating how to evaluate model performance across multiple metrics in practice.