How to remove duplicate records from data

· Category: Data Science

Short answer

Duplicate records distort aggregations, inflate counts, and bias statistical estimates, making deduplication a critical preprocessing step.

Steps

  1. Define what constitutes a duplicate based on all columns or a subset of key fields.
  2. Use pandas drop_duplicates with subset and keep parameters to control retention.
  3. Sort data before deduplication to ensure deterministic retention of the desired record.
  4. Validate the reduction in row count and inspect removed records for patterns.
  5. Implement upstream fixes to prevent duplicate ingestion in the future.

Tips

  • Use hashing for fast duplicate detection in very large datasets.
  • Consider fuzzy deduplication for near-duplicate text records.
  • Keep a log of deduplication rules for auditing and reproducibility.
  • Compare before-and-after summary statistics to catch unexpected effects.

Common issues

  • Partial duplicates where only some columns match causing ambiguous definitions.
  • Case sensitivity and whitespace leading to false negatives in duplicate detection.
  • Accidentally dropping useful data when keep is set incorrectly.
  • Performance bottlenecks when deduplicating datasets with millions of rows.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.