How to version datasets for reproducibility

Question

QA Hub Editorial · Accepted Answer

Short answer

Dataset versioning tracks changes to training and evaluation data, enabling reproducibility, auditability, and rollback capability.

Steps

Store large data files in remote storage such as S3, GCS, or Azure Blob.
Use DVC to track dataset versions with lightweight metadata files committed to Git.
Tag dataset versions that correspond to specific model releases.
Document data collection and preprocessing steps in a README or data card.
Automate version bumps in CI/CD when source data or transformation code changes.

Tips

Use DVC pipelines to link data versions to specific code and parameter states.
Share remote storage credentials securely through environment variables.
Keep a changelog summarizing additions, deletions, and corrections per version.
Compare dataset statistics across versions to detect unexpected drift.

Common issues

Forgetting to push DVC metadata and cache to remote storage.
Merge conflicts in DVC lock files when multiple contributors update data.
Storage costs growing unbounded without retention policies.
Inability to reproduce old results due to missing intermediate data versions.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.

Short answer

Steps

Tips

Common issues

Example

Related Questions

What is the difference between precision and recall

What is feature engineering and why is it important

How to preprocess data for machine learning models

How to document a data pipeline

How to optimize slow data pipelines

How to incrementally load data