How to version datasets for reproducibility

· Category: Data Science

Short answer

Dataset versioning tracks changes to training and evaluation data, enabling reproducibility, auditability, and rollback capability.

Steps

  1. Store large data files in remote storage such as S3, GCS, or Azure Blob.
  2. Use DVC to track dataset versions with lightweight metadata files committed to Git.
  3. Tag dataset versions that correspond to specific model releases.
  4. Document data collection and preprocessing steps in a README or data card.
  5. Automate version bumps in CI/CD when source data or transformation code changes.

Tips

  • Use DVC pipelines to link data versions to specific code and parameter states.
  • Share remote storage credentials securely through environment variables.
  • Keep a changelog summarizing additions, deletions, and corrections per version.
  • Compare dataset statistics across versions to detect unexpected drift.

Common issues

  • Forgetting to push DVC metadata and cache to remote storage.
  • Merge conflicts in DVC lock files when multiple contributors update data.
  • Storage costs growing unbounded without retention policies.
  • Inability to reproduce old results due to missing intermediate data versions.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.