How to remove duplicate records from data

Question

QA Hub Editorial · Accepted Answer

Short answer

Duplicate records distort aggregations, inflate counts, and bias statistical estimates, making deduplication a critical preprocessing step.

Steps

Define what constitutes a duplicate based on all columns or a subset of key fields.
Use pandas drop_duplicates with subset and keep parameters to control retention.
Sort data before deduplication to ensure deterministic retention of the desired record.
Validate the reduction in row count and inspect removed records for patterns.
Implement upstream fixes to prevent duplicate ingestion in the future.

Tips

Use hashing for fast duplicate detection in very large datasets.
Consider fuzzy deduplication for near-duplicate text records.
Keep a log of deduplication rules for auditing and reproducibility.
Compare before-and-after summary statistics to catch unexpected effects.

Common issues

Partial duplicates where only some columns match causing ambiguous definitions.
Case sensitivity and whitespace leading to false negatives in duplicate detection.
Accidentally dropping useful data when keep is set incorrectly.
Performance bottlenecks when deduplicating datasets with millions of rows.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.

Short answer

Steps

Tips

Common issues

Example

Related Questions

How to handle missing values in a dataset

How to clean messy data with pandas

How to monitor data quality in pipelines

How to detect outliers in data

How to use groupby for aggregations

How to filter and slice DataFrames