How to detect outliers in data
· Category: Data Science
Short answer
Outliers are observations that deviate markedly from other data points and can distort models, summaries, and visualizations.
Steps
- Visualize distributions with box plots, histograms, and scatter plots.
- Apply the IQR method to flag points below Q1 minus 1.5 times IQR or above Q3 plus 1.5 times IQR.
- Calculate z-scores and mark points exceeding a threshold such as three standard deviations.
- Use isolation forests or local outlier factor for multivariate outlier detection.
- Investigate flagged points to determine whether they represent errors or genuine anomalies.
Tips
- Winsorize rather than delete outliers when they contain valid extreme values.
- Use robust scalers that are less sensitive to outliers than standard scaling.
- Consider the context; what is an outlier in one domain may be normal in another.
- Log-transform skewed data to reduce the influence of extreme values.
Common issues
- Automatic removal of outliers discarding important signals.
- Using mean-based thresholds on heavily skewed distributions.
- Failing to investigate the root cause of detected outliers.
- Confusing high-leverage points with influential outliers in regression.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.