How to apply the central limit theorem

· Category: Data Science

Short answer

The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as sample size grows, regardless of the population distribution shape.

Steps

  1. Collect independent random samples of size n from the population.
  2. Compute the mean for each sample.
  3. Plot the distribution of these sample means.
  4. Observe that as n increases, the distribution becomes increasingly normal and its variance shrinks.
  5. Use this normality to construct confidence intervals and perform hypothesis tests on the mean.

Tips

  • The theorem applies to sums and means, not necessarily to other statistics.
  • Larger samples are needed for highly skewed or heavy-tailed distributions.
  • Independence of observations is crucial; violations can invalidate the result.
  • The theorem justifies using z-tests and t-tests for large samples even when the underlying data is non-normal.

Common issues

  • Applying the theorem to small samples from highly non-normal populations.
  • Violating independence through clustered or time-series sampling.
  • Confusing the distribution of sample means with the distribution of individual observations.
  • Ignoring finite sample biases when n is moderate.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.