How to check assumptions for statistical tests

· Category: Data Science

Short answer

Statistical tests rely on assumptions about data distribution, variance, and independence that must be verified to ensure valid conclusions.

Steps

  1. Test normality with Shapiro-Wilk, Anderson-Darling, or visual Q-Q plots.
  2. Assess homoscedasticity with Levene's test or by plotting residuals versus fitted values.
  3. Check independence by examining data collection methods and autocorrelation functions.
  4. Review sample size to determine whether asymptotic approximations apply.
  5. If assumptions are violated, apply transformations or switch to robust non-parametric tests.

Tips

  • Use graphical checks in addition to formal tests since large samples often reject trivial deviations.
  • Log-transform right-skewed data to improve normality and stabilize variance.
  • Use bootstrap methods when assumptions are uncertain.
  • Report assumption checks transparently to strengthen credibility.

Common issues

  • Over-reliance on formal normality tests with large sample sizes.
  • Ignoring clustered or hierarchical structures that violate independence.
  • Applying transformations that complicate coefficient interpretation.
  • Continuing with parametric tests after clear assumption violations.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.