How to clean messy data with pandas

· Category: Data Science

Short answer

Data cleaning transforms raw, inconsistent data into a structured format suitable for analysis by handling errors, missing values, and formatting issues.

Steps

  1. Load data and inspect the first few rows, data types, and summary statistics.
  2. Standardize column names by stripping whitespace, lowercasing, and replacing spaces with underscores.
  3. Identify and correct inconsistent entries such as mixed date formats or categorical typos.
  4. Remove or impute missing values based on the mechanism of missingness.
  5. Validate the cleaned dataset by re-running summary statistics and integrity checks.

Tips

  • Never modify raw data in place; keep a copy of the original file.
  • Use regular expressions to clean text fields with complex patterns.
  • Document every transformation step for reproducibility and audit trails.
  • Apply type coercion carefully to avoid silent data loss.

Common issues

  • Hidden whitespace or non-printable characters causing merge failures.
  • Implicit type conversions turning numeric columns into objects.
  • Inconsistent encoding when reading files from different sources.
  • Over-cleaning that removes valid outliers or rare but legitimate values.

Example

import pandas as pd

df = pd.read_csv('data.csv')
df = df.dropna().drop_duplicates()
df['date'] = pd.to_datetime(df['date'])
print(df.head())

This example shows how to load a CSV, remove missing values and duplicates, and parse dates in just a few lines of pandas code.