How to create a custom dataset in PyTorch

· Category: AI & Machine Learning

Short answer

Custom datasets in PyTorch allow you to integrate proprietary data formats, apply on-the-fly preprocessing, and control batch collation logic.

Steps

  1. Subclass torch.utils.data.Dataset and implement len and getitem.
  2. Load file paths or metadata in init to avoid redundant disk access.
  3. Apply transformations inside getitem for data augmentation.
  4. Instantiate a DataLoader with batch_size, shuffle, num_workers, and collate_fn.
  5. Iterate over the DataLoader in your training loop.

Tips

  • Use multiprocessing start method spawn on macOS and Windows to avoid CUDA issues.
  • Implement a custom collate_fn for variable-length sequences like text or time series.
  • Pin memory and use non-blocking transfers to improve GPU data loading.
  • Cache small datasets in RAM to eliminate disk I/O bottlenecks.

Common issues

  • Deadlocks in multiprocessing data loaders due to shared CUDA contexts.
  • Slow training because num_workers is set too low or too high.
  • Inconsistent tensor shapes when batching variable-size inputs without a collate function.
  • Memory leaks from accumulating tensors inside the dataset.

Example

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

This example generates a detailed classification report, illustrating how to evaluate model performance across multiple metrics in practice.