How to version datasets for reproducibility
· Category: Data Science
Short answer
Dataset versioning tracks changes to training and evaluation data, enabling reproducibility, auditability, and rollback capability.
Steps
- Store large data files in remote storage such as S3, GCS, or Azure Blob.
- Use DVC to track dataset versions with lightweight metadata files committed to Git.
- Tag dataset versions that correspond to specific model releases.
- Document data collection and preprocessing steps in a README or data card.
- Automate version bumps in CI/CD when source data or transformation code changes.
Tips
- Use DVC pipelines to link data versions to specific code and parameter states.
- Share remote storage credentials securely through environment variables.
- Keep a changelog summarizing additions, deletions, and corrections per version.
- Compare dataset statistics across versions to detect unexpected drift.
Common issues
- Forgetting to push DVC metadata and cache to remote storage.
- Merge conflicts in DVC lock files when multiple contributors update data.
- Storage costs growing unbounded without retention policies.
- Inability to reproduce old results due to missing intermediate data versions.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.