How to detect outliers in data

Question

QA Hub Editorial · Accepted Answer

Short answer

Outliers are observations that deviate markedly from other data points and can distort models, summaries, and visualizations.

Steps

Visualize distributions with box plots, histograms, and scatter plots.
Apply the IQR method to flag points below Q1 minus 1.5 times IQR or above Q3 plus 1.5 times IQR.
Calculate z-scores and mark points exceeding a threshold such as three standard deviations.
Use isolation forests or local outlier factor for multivariate outlier detection.
Investigate flagged points to determine whether they represent errors or genuine anomalies.

Tips

Winsorize rather than delete outliers when they contain valid extreme values.
Use robust scalers that are less sensitive to outliers than standard scaling.
Consider the context; what is an outlier in one domain may be normal in another.
Log-transform skewed data to reduce the influence of extreme values.

Common issues

Automatic removal of outliers discarding important signals.
Using mean-based thresholds on heavily skewed distributions.
Failing to investigate the root cause of detected outliers.
Confusing high-leverage points with influential outliers in regression.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.

Short answer

Steps

Tips

Common issues

Example

Related Questions

How to remove duplicate records from data

How to clean messy data with pandas

What is the difference between precision and recall

What is feature engineering and why is it important

How to preprocess data for machine learning models

How to document a data pipeline