How to create a custom dataset in PyTorch
· Category: AI & Machine Learning
Short answer
Custom datasets in PyTorch allow you to integrate proprietary data formats, apply on-the-fly preprocessing, and control batch collation logic.
Steps
- Subclass torch.utils.data.Dataset and implement len and getitem.
- Load file paths or metadata in init to avoid redundant disk access.
- Apply transformations inside getitem for data augmentation.
- Instantiate a DataLoader with batch_size, shuffle, num_workers, and collate_fn.
- Iterate over the DataLoader in your training loop.
Tips
- Use multiprocessing start method spawn on macOS and Windows to avoid CUDA issues.
- Implement a custom collate_fn for variable-length sequences like text or time series.
- Pin memory and use non-blocking transfers to improve GPU data loading.
- Cache small datasets in RAM to eliminate disk I/O bottlenecks.
Common issues
- Deadlocks in multiprocessing data loaders due to shared CUDA contexts.
- Slow training because num_workers is set too low or too high.
- Inconsistent tensor shapes when batching variable-size inputs without a collate function.
- Memory leaks from accumulating tensors inside the dataset.
Example
from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
This example generates a detailed classification report, illustrating how to evaluate model performance across multiple metrics in practice.