How to handle missing values in a dataset

· Category: Data Science

Short answer

Missing values can bias models and reduce statistical power if not handled thoughtfully through deletion, imputation, or modeling strategies.

Steps

  1. Identify missingness patterns using heatmaps and summary counts per column.
  2. Determine whether data is missing completely at random, at random, or not at random.
  3. Delete rows or columns only when missingness is minimal and random.
  4. Impute with mean, median, or mode for simple cases; use k-NN or iterative imputation for complex relationships.
  5. Flag imputed values with indicator variables so models can account for uncertainty.

Tips

  • Compare model performance across multiple imputation strategies.
  • Use domain knowledge to infer reasonable values when possible.
  • Avoid mean imputation for highly skewed distributions.
  • Consider missingness as a feature if it carries predictive information.

Common issues

  • Deleting too many rows and losing statistical power.
  • Mean imputation reducing variance and distorting correlations.
  • Ignoring missingness indicators leading to undetected data quality problems.
  • Advanced imputation methods overfitting to training data patterns.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.