How to understand and use p-values correctly

· Category: Data Science

Short answer

A p-value measures the probability of observing data as extreme as the sample data, assuming the null hypothesis is true.

Steps

  1. Formulate a clear null hypothesis before conducting the analysis.
  2. Choose an appropriate statistical test based on data type and distribution.
  3. Compute the p-value from the test statistic using the relevant distribution.
  4. Compare the p-value to a predetermined significance level alpha.
  5. Report the exact p-value, effect size, and confidence interval rather than just significance.

Tips

  • Treat p-values as continuous measures of evidence, not binary switches.
  • A non-significant p-value does not prove the null hypothesis is true.
  • Replicate studies to confirm findings rather than relying on a single p-value.
  • Combine p-values with domain knowledge and study design quality.

Common issues

  • Treating p less than 0.05 as automatic proof of a real effect.
  • P-hacking by selectively reporting analyses that achieved significance.
  • Ignoring practical significance when effects are statistically significant but tiny.
  • Misunderstanding that the p-value is not the probability that the hypothesis is true.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.