How to understand and use p-values correctly
· Category: Data Science
Short answer
A p-value measures the probability of observing data as extreme as the sample data, assuming the null hypothesis is true.
Steps
- Formulate a clear null hypothesis before conducting the analysis.
- Choose an appropriate statistical test based on data type and distribution.
- Compute the p-value from the test statistic using the relevant distribution.
- Compare the p-value to a predetermined significance level alpha.
- Report the exact p-value, effect size, and confidence interval rather than just significance.
Tips
- Treat p-values as continuous measures of evidence, not binary switches.
- A non-significant p-value does not prove the null hypothesis is true.
- Replicate studies to confirm findings rather than relying on a single p-value.
- Combine p-values with domain knowledge and study design quality.
Common issues
- Treating p less than 0.05 as automatic proof of a real effect.
- P-hacking by selectively reporting analyses that achieved significance.
- Ignoring practical significance when effects are statistically significant but tiny.
- Misunderstanding that the p-value is not the probability that the hypothesis is true.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.