How to check assumptions for statistical tests
· Category: Data Science
Short answer
Statistical tests rely on assumptions about data distribution, variance, and independence that must be verified to ensure valid conclusions.
Steps
- Test normality with Shapiro-Wilk, Anderson-Darling, or visual Q-Q plots.
- Assess homoscedasticity with Levene's test or by plotting residuals versus fitted values.
- Check independence by examining data collection methods and autocorrelation functions.
- Review sample size to determine whether asymptotic approximations apply.
- If assumptions are violated, apply transformations or switch to robust non-parametric tests.
Tips
- Use graphical checks in addition to formal tests since large samples often reject trivial deviations.
- Log-transform right-skewed data to improve normality and stabilize variance.
- Use bootstrap methods when assumptions are uncertain.
- Report assumption checks transparently to strengthen credibility.
Common issues
- Over-reliance on formal normality tests with large sample sizes.
- Ignoring clustered or hierarchical structures that violate independence.
- Applying transformations that complicate coefficient interpretation.
- Continuing with parametric tests after clear assumption violations.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.