How to clean messy data with pandas
· Category: Data Science
Short answer
Data cleaning transforms raw, inconsistent data into a structured format suitable for analysis by handling errors, missing values, and formatting issues.
Steps
- Load data and inspect the first few rows, data types, and summary statistics.
- Standardize column names by stripping whitespace, lowercasing, and replacing spaces with underscores.
- Identify and correct inconsistent entries such as mixed date formats or categorical typos.
- Remove or impute missing values based on the mechanism of missingness.
- Validate the cleaned dataset by re-running summary statistics and integrity checks.
Tips
- Never modify raw data in place; keep a copy of the original file.
- Use regular expressions to clean text fields with complex patterns.
- Document every transformation step for reproducibility and audit trails.
- Apply type coercion carefully to avoid silent data loss.
Common issues
- Hidden whitespace or non-printable characters causing merge failures.
- Implicit type conversions turning numeric columns into objects.
- Inconsistent encoding when reading files from different sources.
- Over-cleaning that removes valid outliers or rare but legitimate values.
Example
import pandas as pd
df = pd.read_csv('data.csv')
df = df.dropna().drop_duplicates()
df['date'] = pd.to_datetime(df['date'])
print(df.head())
This example shows how to load a CSV, remove missing values and duplicates, and parse dates in just a few lines of pandas code.