How to handle missing values in a dataset
· Category: Data Science
Short answer
Missing values can bias models and reduce statistical power if not handled thoughtfully through deletion, imputation, or modeling strategies.
Steps
- Identify missingness patterns using heatmaps and summary counts per column.
- Determine whether data is missing completely at random, at random, or not at random.
- Delete rows or columns only when missingness is minimal and random.
- Impute with mean, median, or mode for simple cases; use k-NN or iterative imputation for complex relationships.
- Flag imputed values with indicator variables so models can account for uncertainty.
Tips
- Compare model performance across multiple imputation strategies.
- Use domain knowledge to infer reasonable values when possible.
- Avoid mean imputation for highly skewed distributions.
- Consider missingness as a feature if it carries predictive information.
Common issues
- Deleting too many rows and losing statistical power.
- Mean imputation reducing variance and distorting correlations.
- Ignoring missingness indicators leading to undetected data quality problems.
- Advanced imputation methods overfitting to training data patterns.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.