How to handle missing values in a dataset

Question

QA Hub Editorial · Accepted Answer

Short answer

Missing values can bias models and reduce statistical power if not handled thoughtfully through deletion, imputation, or modeling strategies.

Steps

Identify missingness patterns using heatmaps and summary counts per column.
Determine whether data is missing completely at random, at random, or not at random.
Delete rows or columns only when missingness is minimal and random.
Impute with mean, median, or mode for simple cases; use k-NN or iterative imputation for complex relationships.
Flag imputed values with indicator variables so models can account for uncertainty.

Tips

Compare model performance across multiple imputation strategies.
Use domain knowledge to infer reasonable values when possible.
Avoid mean imputation for highly skewed distributions.
Consider missingness as a feature if it carries predictive information.

Common issues

Deleting too many rows and losing statistical power.
Mean imputation reducing variance and distorting correlations.
Ignoring missingness indicators leading to undetected data quality problems.
Advanced imputation methods overfitting to training data patterns.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.

Short answer

Steps

Tips

Common issues

Example

Related Questions

How to transform data types in pandas

How to remove duplicate records from data

How to clean messy data with pandas

How to monitor data quality in pipelines

How to use groupby for aggregations

How to filter and slice DataFrames