How to detect outliers in data

· Category: Data Science

Short answer

Outliers are observations that deviate markedly from other data points and can distort models, summaries, and visualizations.

Steps

  1. Visualize distributions with box plots, histograms, and scatter plots.
  2. Apply the IQR method to flag points below Q1 minus 1.5 times IQR or above Q3 plus 1.5 times IQR.
  3. Calculate z-scores and mark points exceeding a threshold such as three standard deviations.
  4. Use isolation forests or local outlier factor for multivariate outlier detection.
  5. Investigate flagged points to determine whether they represent errors or genuine anomalies.

Tips

  • Winsorize rather than delete outliers when they contain valid extreme values.
  • Use robust scalers that are less sensitive to outliers than standard scaling.
  • Consider the context; what is an outlier in one domain may be normal in another.
  • Log-transform skewed data to reduce the influence of extreme values.

Common issues

  • Automatic removal of outliers discarding important signals.
  • Using mean-based thresholds on heavily skewed distributions.
  • Failing to investigate the root cause of detected outliers.
  • Confusing high-leverage points with influential outliers in regression.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.