How to remove duplicate records from data
· Category: Data Science
Short answer
Duplicate records distort aggregations, inflate counts, and bias statistical estimates, making deduplication a critical preprocessing step.
Steps
- Define what constitutes a duplicate based on all columns or a subset of key fields.
- Use pandas drop_duplicates with subset and keep parameters to control retention.
- Sort data before deduplication to ensure deterministic retention of the desired record.
- Validate the reduction in row count and inspect removed records for patterns.
- Implement upstream fixes to prevent duplicate ingestion in the future.
Tips
- Use hashing for fast duplicate detection in very large datasets.
- Consider fuzzy deduplication for near-duplicate text records.
- Keep a log of deduplication rules for auditing and reproducibility.
- Compare before-and-after summary statistics to catch unexpected effects.
Common issues
- Partial duplicates where only some columns match causing ambiguous definitions.
- Case sensitivity and whitespace leading to false negatives in duplicate detection.
- Accidentally dropping useful data when keep is set incorrectly.
- Performance bottlenecks when deduplicating datasets with millions of rows.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.