Statistical Foundations
& Probability Theory
The bedrock of everything. Most analytics errors — wrong conclusions, over-confident predictions, misleading dashboards — trace back to misunderstood statistics. This chapter builds an unshakeable foundation.
Why Statistics Is Not What You Think It Is
Most people learn statistics backwards. They learn formulas first, intuition last — if ever. The result is analysts who can run a t-test but can't tell you what it actually means. This chapter fixes that.
Statistics is not a collection of tests. It is a framework for reasoning under uncertainty. Every statistical procedure answers the same fundamental question: "Given that I've only seen a sample of reality, what can I conclude about all of reality?"
You want to know something about a population (all customers, all users, all possible outcomes), but you can only observe a sample (the data you actually have). Statistics is the machinery for making rigorous claims about populations from samples — while being honest about the uncertainty that gap creates.
1.1 Probability — The Language of Uncertainty
Two Interpretations That Lead to Different Statistics
There are two fundamentally different ways to interpret probability, and they lead to entirely different statistical frameworks. Understanding the difference is not academic — it determines how you should analyze data.
Frequentist Probability
- Definition: Long-run frequency of an event over infinite repetitions
- P(heads) = 0.5 means: flip this coin infinitely, ~50% are heads
- Parameters are fixed (unknown but not random)
- Data is random; parameters are not
- Gives us: p-values, confidence intervals, hypothesis tests
- Problem: Can't make probability statements about parameters themselves
Bayesian Probability
- Definition: Degree of belief or plausibility of a statement
- P(heads) = 0.5 means: I believe heads and tails are equally plausible
- Parameters are random variables with distributions
- Both data and parameters are random
- Gives us: posterior distributions, credible intervals, model comparison
- Advantage: Can say "there's a 94% probability the effect is positive"
A frequentist meteorologist says: "70% chance of rain" means that on days with conditions like today, it rains 70% of the time historically. The probability is about the repeatability of the setup.
A Bayesian meteorologist says: "70% chance of rain" means I have 70% confidence it will rain today, given all the evidence I've incorporated (satellite data, models, past patterns). The probability is about belief, updated by evidence.
Practical takeaway: Most business analytics lives in frequentist-land (p-values, confidence intervals). Most modern machine learning lives in Bayesian-land (posterior distributions, uncertainty quantification). You need fluency in both.
The Probability Axioms — The Foundation of Everything
All of probability theory rests on three axioms introduced by Kolmogorov in 1933. These are not empirical observations — they are definitions from which everything else is derived.
From these three axioms, every probability rule follows — the complement rule, conditional probability, Bayes' theorem, the law of total probability. Understanding this hierarchy means you can always derive what you need rather than memorizing formulas.
Conditional Probability & Bayes' Theorem
Conditional probability is how we update beliefs when new evidence arrives. It is the engine of Bayesian reasoning and the mathematical foundation of most machine learning.
Bayes' theorem tells us how to rationally update beliefs when we see evidence. In analytics: θ is whatever you want to know (conversion rate, effect size, whether a customer will churn), data is what you've observed, prior is what you believed before seeing the data, and posterior is your updated belief. Every time you build a model, run an experiment, or form a business conclusion, you are — whether you know it or not — doing Bayesian reasoning.
1.2 Distributions — The Shape of Reality
A probability distribution describes all possible values a random variable can take and how likely each is. Choosing the right distribution to model your data is one of the most important and most neglected skills in analytics.
Discrete Distributions You Must Know
| Distribution | Parameters | When to Use | Real Example |
|---|---|---|---|
| Bernoulli | p |
Single binary outcome | Did a user click? Did a customer churn this month? |
| Binomial | n, p |
Count of successes in n trials | Number of sales in 100 calls; number of defects in a batch |
| Poisson | λ |
Count of rare events per unit time/space | Customer support tickets per hour; server errors per day |
| Geometric | p |
Number of trials until first success | How many emails until first reply; retries until connection |
| Negative Binomial | r, p |
Overdispersed count data (variance > mean) | Purchases per customer (most buy 0, few buy many) |
Continuous Distributions You Must Know
| Distribution | Parameters | When to Use | Real Example |
|---|---|---|---|
| Normal (Gaussian) | μ, σ |
Sums of independent variables, measurement error | Heights, IQ scores, prediction errors after log transform |
| Log-Normal | μ, σ |
Positive data with right skew; multiplicative processes | Income, website session duration, stock prices, latency |
| Exponential | λ |
Time between Poisson events | Time between purchases; time to failure; customer lifetime |
| Beta | α, β |
Probabilities and proportions (0 to 1) | Conversion rate uncertainty; click-through rate priors |
| Gamma | α, β |
Positive skewed data, waiting times | Insurance claim amounts; time to complete a task |
| Pareto (Power Law) | α, x_m |
Heavy-tailed phenomena, "80-20 rule" data | Revenue by customer; social media followers; city sizes |
The most common mistake in analytics is assuming everything is normally distributed. Revenue, session duration, purchase amounts, customer lifetime value, social engagement — virtually all business metrics are log-normal, exponential, or power-law distributed. When you apply techniques that assume normality to these variables (like standard means and CIs for revenue), you get systematically wrong answers.
Rule of thumb: If a metric can't be negative and has a long right tail, it's probably log-normal. Take the log, analyze that, then exponentiate back.
1.3 The Central Limit Theorem — Why Normal Appears Everywhere
The Central Limit Theorem (CLT) is arguably the most important theorem in all of statistics. It explains why the normal distribution appears so often and justifies most frequentist inferential procedures.
You're measuring something that follows a bizarre, jagged, unknown distribution. You don't know its true mean μ. You take samples of size n=100 repeatedly and compute each sample's mean. Those sample means — despite coming from a weird distribution — will form a normal distribution centered at the true μ. This is what lets us construct confidence intervals and run hypothesis tests without knowing the underlying distribution of our data.
The catch: The CLT applies to means and sums. Not to the individual observations, not to percentiles, not to maxima. A common mistake is applying CLT-based tests to raw skewed data rather than to means.
1.4 Hypothesis Testing — The Most Misused Tool in Analytics
Hypothesis testing is the procedure for deciding, based on data, whether an observed effect is "real" or could plausibly be explained by random chance. It is also the most abused tool in all of analytics, responsible for a replication crisis across multiple fields.
The Formal Framework
H₀ is the "boring" hypothesis — no effect, no difference, nothing interesting. Example: "The new checkout flow has no effect on conversion rate vs the old one." Always try to reject H₀, never accept it.
What you believe is actually true. "The new checkout flow has a different conversion rate." Can be one-tailed (direction specified) or two-tailed (either direction).
α is the false positive rate you're willing to accept. Convention is α=0.05 (5% chance of claiming an effect that doesn't exist). This must be chosen before you see the data — otherwise you're p-hacking.
The test statistic measures how far your observed data is from what H₀ predicts, in units of standard error. The p-value is the probability of seeing data this extreme (or more) if H₀ were true.
If p < α: reject H₀ (result is "statistically significant"). If p ≥ α: fail to reject H₀. Note: you never "accept" H₀ — absence of evidence is not evidence of absence.
What the p-value Actually Means
A p-value is NOT:
- The probability that H₀ is true
- The probability that your result is due to chance
- The probability that you're wrong
- A measure of the size or importance of an effect
A p-value IS: P(seeing data this extreme | H₀ is true). It assumes H₀ is true and asks how surprising your data would be under that assumption. A p-value of 0.03 means: "If there truly were no effect, there would be only a 3% chance of seeing an effect this large in a study of this size."
Type I & Type II Errors — The Error Tradeoff
Statistical Power — The Forgotten Half of Testing
Power is the probability of detecting a real effect when one exists (1 - β). It depends on three things: effect size, sample size, and α. Under-powered studies are a major source of irreproducible results.
1.5 Confidence Intervals — Better Than p-values
A 95% confidence interval (CI) contains the true parameter value in 95% of intervals computed using this procedure. Not that there's a 95% chance the specific interval you computed contains the true value — that's the Bayesian credible interval.
Imagine you build nets that catch fish 95% of the time. You cast your net and either catch or don't catch the fish (the true parameter). Any specific cast either catches it or doesn't — there's no probability about this cast. But the procedure (the type of net) is reliable 95% of the time. That's what "95% confidence" means — it's a statement about the procedure, not about your specific interval.
The Bayesian alternative — a credible interval — does give you what you intuitively want: "there's a 95% probability the true value is in this range."
1.6 Effect Size — What Actually Matters Beyond Significance
Statistical significance tells you whether an effect is distinguishable from zero. Effect size tells you whether it's large enough to matter. With enough data, every trivial difference becomes statistically significant. Always report effect sizes.
| Effect Size Measure | Formula | Use Case | Small / Medium / Large |
|---|---|---|---|
| Cohen's d | (μ₁ - μ₂) / σ_pooled |
Comparing two group means | 0.2 / 0.5 / 0.8 |
| Pearson's r | Cov(X,Y) / (σ_X × σ_Y) |
Linear correlation strength | 0.1 / 0.3 / 0.5 |
| η² (Eta-squared) | SS_effect / SS_total |
ANOVA: variance explained | 0.01 / 0.06 / 0.14 |
| Odds Ratio | (a/b) / (c/d) |
Binary outcomes, 2×2 tables | Context-dependent |
| Relative Risk | P(outcome|exposed) / P(outcome|unexposed) |
Clinical/business experiments | Context-dependent |
1.7 Multiple Testing — The Silent Killer of Analytics
If you run 20 hypothesis tests with α=0.05, you expect one false positive purely by chance, even if no effects are real. This is the multiple testing problem, and it is rampant in business analytics — dashboards with 50 metrics, A/B tests that look at many segments, exploratory analyses that test dozens of features.
Corrections for Multiple Testing
| Method | Controls | How | When to Use |
|---|---|---|---|
| Bonferroni | FWER | α_adj = α/m | Few tests, need strict control (e.g. safety trials) |
| Holm-Bonferroni | FWER | Sequential Bonferroni | Slightly more powerful than Bonferroni |
| Benjamini-Hochberg | FDR | Sort p-values, adjust threshold | Many tests, exploratory analysis, genomics |
| Benjamini-Yekutieli | FDR | BH with dependence correction | When tests are correlated (e.g. correlated features) |
In business analytics, use Benjamini-Hochberg (BH) by default for any analysis involving multiple tests. It controls the False Discovery Rate (FDR) — the expected proportion of your "significant" findings that are actually false positives. Setting FDR at 10% means that among your significant results, you expect 10% to be wrong — usually acceptable for business decisions.
1.8 The Replication Crisis and What It Means for Analytics
Between 2011 and 2020, large-scale replication attempts found that fewer than 50% of published psychology findings and fewer than 60% of cancer biology findings could be successfully replicated. The causes are relevant to every analyst:
- P-hacking: Running many analyses and reporting only p < 0.05 results
- HARKing: Hypothesizing After Results are Known — treating exploratory findings as confirmatory
- Small samples: Underpowered studies produce inflated effect size estimates even when they get lucky and find p < 0.05
- Publication bias: Only significant results get published (or shared in business settings, acted on)
- Researcher degrees of freedom: Many undisclosed choices in analysis (which outliers to remove, which covariates to include) inflate Type I error
Pre-register your analyses: Write down your hypothesis, sample size, test statistic, and α before collecting data. Report effect sizes and CIs, not just p-values. Distinguish exploration from confirmation: EDA and hypothesis generation must be separated from hypothesis testing on the same data. Use holdout sets: Any pattern found on training data must be validated on data the analysis hasn't seen.
Industry Applications of Statistical Foundations
2.1 E-Commerce — Conversion Rate Optimization
An e-commerce company wants to test whether a new checkout button color improves conversion rate from 3.2% to 3.5% or better.
Why statistics matters here, precisely:
- With 10,000 visitors, a 0.3pp difference is indistinguishable from noise — a test this size has less than 20% power to detect it
- Running the test for only 2 days and peeking at results inflates false positive rate from 5% to >40%
- Testing 10 button colors simultaneously requires multiple testing correction or you'll ship a "winner" that was random
- Conversion rate is Binomial(n, p) distributed — CLT applies well for large n, but not for rare events (e.g. checkout completion = 0.2%)
These companies use sequential testing (see Research section) instead of fixed-horizon tests. This allows valid early stopping when an effect is large, without inflating Type I error. They also use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance by controlling for pre-experiment behavior, increasing power by 30-50% at no extra sample cost.
2.2 Healthcare Analytics — Drug Efficacy Studies
A pharmaceutical company analyzes whether a new drug reduces recovery time. This is the domain where misunderstood statistics has the highest stakes.
Critical statistical choices:
- Primary endpoint pre-registration: Must specify "reduction in hospital stay ≥ 2 days" before seeing data
- Intention-to-treat analysis: Include all randomized patients, even dropouts, to avoid selection bias
- Adaptive trial design: Modern trials use Bayesian adaptive designs to update sample allocation as evidence accumulates
- NNT (Number Needed to Treat): More interpretable than p-values — "treat 8 patients to prevent 1 additional death"
2.3 Financial Services — Risk Modeling
A bank models the probability of loan default to set credit limits and pricing.
Distribution selection is critical:
- Default rates are Beta-distributed (bounded 0-1), not normally distributed
- Loss amounts follow power-law or extreme value distributions — the tails are everything for risk
- VaR (Value at Risk) at 99% relies on tail behavior, which normal distribution assumptions severely underestimate (see 2008 financial crisis)
- Copulas are used to model joint distributions of correlated defaults — ignoring correlation was a primary cause of CDO mispricing in 2007
2.4 Technology — Product Analytics
A SaaS company monitors 150 product metrics on a weekly dashboard and wants to detect meaningful changes.
The multiple testing trap in action:
- With 150 metrics at α=0.05, expect 7-8 false alarms per week even if nothing changed
- Most companies' dashboards with "red/green" alerts are generating massive false positive noise
- Solution: Benjamini-Hochberg correction across all monitored metrics, or Bayesian anomaly detection that gives posterior probability rather than binary alert
2.5 Manufacturing — Quality Control
A factory monitors product dimensions to detect when a machine drifts out of calibration.
Classic application of control charts (Shewhart charts):
- UCL/LCL (Upper/Lower Control Limits) are set at μ ± 3σ — based on CLT, only 0.27% of in-control samples fall outside these limits
- But this assumes measurements are normal and independent — both fail for autocorrelated process data (use ARIMA-based SPC instead)
- CUSUM charts detect small persistent shifts faster than Shewhart charts by accumulating evidence over time
When NOT to Use Classic Frequentist Statistics
- Observational data: p-values from observational studies do not establish causation — confounders invalidate the inference
- Peeking at ongoing experiments: Sequential testing requires corrections — vanilla t-tests on peeked data inflate Type I error dramatically
- Very large samples: Everything becomes significant — effect size matters more than significance
- Non-exchangeable units: Users in a social network influence each other — independence assumption fails (use cluster-based variance estimation)
- Rare events: Binomial test breaks down for p < 0.01 — use exact tests or Poisson approximations
Building a Statistical Analysis Toolkit from Scratch
3.1 Fitting & Diagnosing Distributions
The first step in any analysis is understanding what distribution your data actually follows. Never assume — fit and test.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import (norm, lognorm, expon, gamma,
kstest, anderson, shapiro)
# ─── Generate sample business data (replace with your data) ───
np.random.seed(42)
# Revenue data: highly right-skewed, like real purchase amounts
revenue_data = np.random.lognormal(mean=4.5, sigma=1.2, size=1000)
def fit_and_compare_distributions(data, distributions=None):
"""
Fit multiple distributions to data and rank them by goodness-of-fit.
Uses Kolmogorov-Smirnov test: measures maximum distance between
empirical CDF and fitted theoretical CDF.
Smaller KS statistic = better fit.
"""
if distributions is None:
distributions = [
('Normal', norm),
('Log-Normal', lognorm),
('Exponential', expon),
('Gamma', gamma),
]
results = []
for name, dist in distributions:
# Fit distribution — MLE by default
params = dist.fit(data)
# KS test: H0 = data comes from this distribution
ks_stat, p_value = kstest(data, dist.cdf, args=params)
# AIC for model selection (penalizes extra parameters)
log_likelihood = np.sum(dist.logpdf(data, *params))
k = len(params)
aic = 2*k - 2*log_likelihood
results.append({
'distribution': name,
'params': params,
'ks_statistic': ks_stat,
'p_value': p_value,
'aic': aic,
'fits_well': p_value > 0.05 # fail to reject = good fit
})
# Sort by AIC (lower = better)
df = pd.DataFrame(results).sort_values('aic')
return df
result_df = fit_and_compare_distributions(revenue_data)
print(result_df[['distribution', 'ks_statistic', 'p_value', 'aic', 'fits_well']])
# Expected output:
# distribution ks_statistic p_value aic fits_well
# Log-Normal 0.018 0.89 8823 True ← best fit
# Gamma 0.031 0.28 8967 True
# Exponential 0.118 0.000 9801 False
# Normal 0.209 0.000 10234 False ← terrible fit!
def plot_distribution_diagnostics(data, dist_name='lognorm'):
"""4-panel diagnostic plot for distribution assessment."""
dist = getattr(stats, dist_name)
params = dist.fit(data)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle(f'Distribution Diagnostics: {dist_name}', fontsize=14)
# 1. Histogram with fitted PDF
ax = axes[0,0]
ax.hist(data, bins=50, density=True, alpha=0.6, color='steelblue', label='Data')
x = np.linspace(data.min(), data.max(), 200)
ax.plot(x, dist.pdf(x, *params), 'r-', lw=2, label=f'Fitted {dist_name}')
ax.set_title('Histogram + Fitted PDF'); ax.legend()
# 2. QQ Plot — if points fall on diagonal, distribution fits well
ax = axes[0,1]
stats.probplot(data, dist=dist, sparams=params[:-2], plot=ax)
ax.set_title('Q-Q Plot (points on line = good fit)')
# 3. Empirical vs Theoretical CDF
ax = axes[1,0]
sorted_data = np.sort(data)
empirical_cdf = np.arange(1, len(data)+1) / len(data)
theoretical_cdf = dist.cdf(sorted_data, *params)
ax.plot(sorted_data, empirical_cdf, label='Empirical CDF')
ax.plot(sorted_data, theoretical_cdf, 'r--', label='Theoretical CDF')
ax.set_title('CDF Comparison'); ax.legend()
# 4. Log scale histogram (reveals tail behavior)
ax = axes[1,1]
ax.hist(np.log(data[data > 0]), bins=50, density=True,
color='purple', alpha=0.6)
ax.set_title('Log-transformed Data (should look normal for lognormal)')
plt.tight_layout()
return fig
3.2 Hypothesis Testing — The Complete Toolkit
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import (ttest_ind, ttest_rel, mannwhitneyu,
chi2_contingency, fisher_exact,
levene, bartlett, shapiro)
import warnings
def choose_and_run_test(group_a: np.ndarray, group_b: np.ndarray,
paired: bool = False,
alpha: float = 0.05) -> dict:
"""
Automatically selects the appropriate two-group comparison test
based on data properties, then runs it.
Decision tree:
1. Are groups paired? → Paired t-test or Wilcoxon signed-rank
2. Are groups normally distributed? (Shapiro-Wilk)
3. Are variances equal? (Levene's test)
4. Based on above: select t-test variant or Mann-Whitney U
"""
result = {'test_chosen': None, 'statistic': None,
'p_value': None, 'reject_h0': None,
'effect_size': None, 'notes': []}
n_a, n_b = len(group_a), len(group_b)
# Step 1: Check normality (only reliable for n < 5000)
if n_a < 5000 and n_b < 5000:
_, p_norm_a = shapiro(group_a[:5000])
_, p_norm_b = shapiro(group_b[:5000])
is_normal = (p_norm_a > 0.05) and (p_norm_b > 0.05)
else:
# CLT kicks in for large n, normality assumption holds
is_normal = True
result['notes'].append('Large n: invoking CLT for normality')
# Step 2: Check equal variances (Levene's is robust to non-normality)
_, p_var = levene(group_a, group_b)
equal_var = p_var > 0.05
# Step 3: Choose test
if paired:
if is_normal:
stat, p = ttest_rel(group_a, group_b)
result['test_chosen'] = 'Paired t-test'
else:
stat, p = stats.wilcoxon(group_a, group_b)
result['test_chosen'] = 'Wilcoxon Signed-Rank'
elif is_normal:
if equal_var:
stat, p = ttest_ind(group_a, group_b, equal_var=True)
result['test_chosen'] = "Student's t-test"
else:
# Welch's t-test: does NOT assume equal variance
stat, p = ttest_ind(group_a, group_b, equal_var=False)
result['test_chosen'] = "Welch's t-test (unequal variance)"
else:
# Non-parametric alternative: tests median difference
stat, p = mannwhitneyu(group_a, group_b, alternative='two-sided')
result['test_chosen'] = 'Mann-Whitney U (non-parametric)'
# Cohen's d effect size (for t-tests)
pooled_std = np.sqrt((np.std(group_a, ddof=1)**2 +
np.std(group_b, ddof=1)**2) / 2)
cohens_d = (np.mean(group_a) - np.mean(group_b)) / pooled_std
result.update({
'statistic': stat,
'p_value': p,
'reject_h0': p < alpha,
'effect_size': cohens_d,
'effect_interpretation': (
'negligible' if abs(cohens_d) < 0.2 else
'small' if abs(cohens_d) < 0.5 else
'medium' if abs(cohens_d) < 0.8 else
'large'
),
'equal_variance': equal_var,
'normal_data': is_normal,
})
return result
# ─── Example: A/B test on conversion value ────────────────────
np.random.seed(42)
control = np.random.lognormal(4.0, 1.0, 500) # control group
treatment = np.random.lognormal(4.1, 1.0, 500) # 10% higher mean
test_result = choose_and_run_test(control, treatment)
for k, v in test_result.items():
if v is not None: print(f" {k}: {v}")
# Output:
# test_chosen: Welch's t-test (unequal variance)
# p_value: 0.031
# reject_h0: True
# effect_size: 0.138
# effect_interpretation: small
#
# Conclusion: Statistically significant (p=0.031) but SMALL effect.
# The new treatment adds ~10% to revenue BUT the practical question is:
# Is a Cohen's d of 0.138 worth the implementation cost?
3.3 Power Analysis — Design Your Study Right
from statsmodels.stats.power import TTestIndPower, NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
import numpy as np
import matplotlib.pyplot as plt
def ab_test_sample_size(
baseline_rate: float, # current conversion rate, e.g. 0.05
minimum_effect: float, # smallest effect worth detecting, e.g. 0.01
alpha: float = 0.05,
power: float = 0.80,
two_tailed: bool = True
) -> dict:
"""
Calculate required sample size for a proportion A/B test.
This is the FIRST thing you should do before running any experiment.
Running without power analysis is how you waste months on
underpowered tests that give you garbage results.
"""
treatment_rate = baseline_rate + minimum_effect
# Cohen's h for proportions
effect_size = proportion_effectsize(treatment_rate, baseline_rate)
analysis = NormalIndPower()
n_per_group = analysis.solve_power(
effect_size = effect_size,
power = power,
alpha = alpha,
alternative = 'two-sided' if two_tailed else 'larger'
)
n_per_group = int(np.ceil(n_per_group))
return {
'n_per_group': n_per_group,
'total_n': n_per_group * 2,
'effect_size_h': effect_size,
'baseline': baseline_rate,
'treatment': treatment_rate,
'relative_lift': minimum_effect / baseline_rate,
'days_needed': None # fill in: n / daily_traffic
}
# ─── Real scenarios ────────────────────────────────────────────
scenarios = [
("Checkout CTA color", 0.032, 0.003), # 3.2% → 3.5%
("Pricing page layout", 0.050, 0.010), # 5.0% → 6.0%
("Email subject line", 0.220, 0.020), # 22% → 24% open rate
]
print(f"{'Experiment':<25} {'N/group':>8} {'Total N':>8} {'Relative lift':>14}")
print("-"*60)
for name, base, lift in scenarios:
r = ab_test_sample_size(base, lift)
print(f"{name:<25} {r['n_per_group']:>8,} {r['total_n']:>8,} {r['relative_lift']:>13.1%}")
# Output:
# Experiment N/group Total N Relative lift
# ------------------------------------------------------------
# Checkout CTA color 29,140 58,280 9.4%
# Pricing page layout 4,414 8,828 20.0%
# Email subject line 3,055 6,110 9.1%
#
# Key insight: Detecting small absolute lifts on low base rates
# requires MASSIVE sample sizes. Most companies run A/B tests
# 10-50x underpowered. Their "winners" are mostly noise.
# ─── Power curve: visualize power across sample sizes ─────────
def plot_power_curve(effect_sizes=[0.1, 0.2, 0.3, 0.5]):
analysis = TTestIndPower()
n_range = np.arange(10, 500, 10)
plt.figure(figsize=(10, 6))
for d in effect_sizes:
powers = [analysis.power(d, n, alpha=0.05) for n in n_range]
plt.plot(n_range, powers, label=f"Cohen's d = {d}", lw=2)
plt.axhline(0.8, color='red', linestyle='--', label='80% power threshold')
plt.xlabel('Sample size per group'); plt.ylabel('Statistical Power')
plt.title('Power Curves: How much data do you need?')
plt.legend(); plt.grid(alpha=0.3)
return plt.gcf()
3.4 Multiple Testing Correction — Protecting Against False Discoveries
import numpy as np
import pandas as pd
from statsmodels.stats.multitest import multipletests
def analyze_multiple_metrics(
p_values: list,
metric_names: list,
alpha: float = 0.05,
method: str = 'fdr_bh' # Benjamini-Hochberg FDR
) -> pd.DataFrame:
"""
Apply multiple testing correction to a batch of p-values.
Methods available:
'bonferroni' - FWER control, very conservative
'holm' - FWER, step-down, less conservative than Bonferroni
'fdr_bh' - Benjamini-Hochberg FDR, best for exploratory analysis
'fdr_by' - Benjamini-Yekutieli, FDR with dependence correction
"""
reject, p_corrected, _, _ = multipletests(
p_values, alpha=alpha, method=method
)
df = pd.DataFrame({
'metric': metric_names,
'p_raw': p_values,
'p_corrected': p_corrected,
'significant': reject
}).sort_values('p_raw')
df['p_raw'] = df['p_raw'].round(4)
df['p_corrected'] = df['p_corrected'].round(4)
return df
# ─── Simulate a product dashboard with 20 metrics ─────────────
np.random.seed(42)
# 18 null effects + 2 real effects
p_values = (
list(np.random.uniform(0, 1, 18)) + # noise
[0.003, 0.018] # real signals
)
metric_names = [f"metric_{i}" for i in range(18)] + ["dau", "revenue_per_user"]
# Without correction: naive α=0.05
naive_significant = sum(1 for p in p_values if p < 0.05)
# With BH correction
result = analyze_multiple_metrics(p_values, metric_names)
bh_significant = result['significant'].sum()
print(f"Without correction: {naive_significant} 'significant' results")
print(f"With BH correction: {bh_significant} significant results")
print("\nTop findings after correction:")
print(result[result['significant']][['metric','p_raw','p_corrected']])
# Output:
# Without correction: 3 'significant' results ← 1 false positive!
# With BH correction: 2 significant results ← correct!
#
# metric p_raw p_corrected
# revenue_per_user 0.003 0.060 ← still significant
# dau 0.018 0.180 ← still significant
#
# The correction eliminates the noise and keeps the real signals.
3.5 The Bootstrap — When Formulas Don't Exist
The bootstrap is perhaps the most powerful and underused tool in the statistician's toolkit. When the distribution of a statistic is unknown or hard to derive analytically (median, correlation, custom metrics), the bootstrap estimates it empirically.
import numpy as np
from scipy.stats import bootstrap as scipy_bootstrap
def bootstrap_ci(
data: np.ndarray,
statistic: callable,
n_boot: int = 10_000,
ci: float = 0.95,
method: str = 'percentile'
) -> dict:
"""
Bootstrap confidence interval for ANY statistic.
Works for: mean, median, std, Gini coefficient,
AUC, custom revenue metrics — anything you can compute.
method options:
'percentile' - simple, robust for symmetric distributions
'bca' - bias-corrected and accelerated, more accurate
"""
boot_stats = np.empty(n_boot)
n = len(data)
for i in range(n_boot):
# Resample WITH replacement — the core of bootstrap
sample = data[np.random.randint(0, n, n)]
boot_stats[i] = statistic(sample)
alpha = 1 - ci
lower = np.percentile(boot_stats, 100 * alpha/2)
upper = np.percentile(boot_stats, 100 * (1 - alpha/2))
return {
'point_estimate': statistic(data),
'ci_lower': lower,
'ci_upper': upper,
'ci_width': upper - lower,
'n_bootstrap': n_boot,
'confidence_level': ci
}
# ─── Use cases ─────────────────────────────────────────────────
np.random.seed(42)
revenue = np.random.lognormal(5, 1.5, 1000) # realistic revenue data
# 1. CI for median (no closed-form formula)
median_ci = bootstrap_ci(revenue, np.median)
print(f"Median revenue: ${median_ci['point_estimate']:.0f}")
print(f"95% CI: (${median_ci['ci_lower']:.0f}, ${median_ci['ci_upper']:.0f})")
# 2. CI for 90th percentile (extremely skewed, bootstrap essential)
p90_ci = bootstrap_ci(revenue, lambda x: np.percentile(x, 90))
print(f"\n90th percentile revenue: ${p90_ci['point_estimate']:.0f}")
print(f"95% CI: (${p90_ci['ci_lower']:.0f}, ${p90_ci['ci_upper']:.0f})")
# 3. CI for Gini coefficient (inequality measure)
# No simple formula exists — bootstrap is the only practical option
def gini(x):
n = len(x)
s = np.sort(x)
return (2 * np.sum(np.arange(1, n+1) * s) / (n * s.sum())) - (n+1)/n
gini_ci = bootstrap_ci(revenue, gini)
print(f"\nGini coefficient: {gini_ci['point_estimate']:.3f}")
print(f"95% CI: ({gini_ci['ci_lower']:.3f}, {gini_ci['ci_upper']:.3f})")
- You have purchase amount data that's right-skewed (most users spend $0–50, a few spend $500+). You want to compare average purchase amounts between two versions of a pricing page. What test would
choose_and_run_test()select and why? What would you do differently? - An A/B test on a checkout page with baseline conversion 1.5% needs to detect a 0.3pp lift. Calculate the required sample size using the code above. If your site gets 5,000 daily visitors (split equally), how many days must the test run?
- A product dashboard monitors 50 metrics. After a feature launch, 4 metrics show p < 0.05. After applying BH correction, only 1 remains significant. What does this tell you about the launch? What's the next step?
The Cutting Edge of Statistical Practice
4.1 Sequential Testing & Always-Valid Inference
Classical A/B testing requires you to fix sample size in advance and look at results only once. In practice, businesses peek at results continuously — which inflates Type I error dramatically. Sequential testing solves this by designing tests that are valid at any stopping time.
What This Means for Your Work
- Replace fixed-horizon tests with sequential tests (libraries:
scipy.statsupcoming, orstreamlit-multivariate-testing) - If you must peek: pre-register the peek times and use Bonferroni correction for the number of peeks
- Bayesian testing is naturally sequential — posterior probability updates are always valid to examine
4.2 The p-value Reform Movement
In 2019, a letter signed by over 800 statisticians in Nature called for retiring the concept of "statistical significance." This is not fringe — it reflects a genuine crisis in how statistics is practiced.
4.3 Conformal Prediction — Distribution-Free Uncertainty
Classical confidence intervals make distributional assumptions (usually normality). Conformal prediction produces valid prediction intervals with no distributional assumptions whatsoever — intervals that are guaranteed to cover the true value with at least (1-α) probability, regardless of the data distribution.
Most prediction intervals in business analytics assume the errors are normally distributed — they're not (revenue, clicks, durations are all non-normal). Conformal prediction gives you valid intervals for any model on any data. The Python library MAPIE implements this and wraps any sklearn-compatible model. This is rapidly becoming the standard for uncertainty quantification in production ML.
4.4 Bootstrap Variants & Modern Resampling
4.5 Open Research Problems
These are areas where the current methods are known to be inadequate and active research is ongoing:
- Network interference in A/B tests: When units (users) influence each other through a network, the independence assumption fails. Current solutions (cluster randomization, ego-network experiments) are imperfect. Active research at LinkedIn, Twitter, Meta.
- Long-term effects: Most A/B tests run for 1-2 weeks and measure short-term metrics. The correlation between short-term and long-term effects is often low or negative. No consensus solution exists.
- Heterogeneous treatment effects: Average treatment effects hide that different user segments respond differently. CATE (Conditional Average Treatment Effect) estimation is a growing area using causal forests and meta-learners.
- Valid inference after model selection: Using the same data to select a model and estimate its performance is known to produce over-confident estimates, but the right correction is still debated.
- Bandit vs A/B testing: Bayesian multi-armed bandits adapt assignment probabilities as evidence accumulates, but their frequentist properties are not always well understood. When to use bandits vs fixed A/B tests is not fully settled.
4.6 Key Tools & Libraries to Know in 2026
| Library | Purpose | When to Use |
|---|---|---|
| scipy.stats | Core statistical tests | t-tests, KS-tests, distribution fitting — your daily workhorse |
| statsmodels | Regression, time series, stats models | OLS, logistic regression, ARIMA, survival analysis |
| pingouin | Statistical tests with effect sizes | ANOVA, pairwise tests — reports effect sizes automatically |
| MAPIE | Conformal prediction | Valid prediction intervals for any sklearn model |
| bootstrapped | Facebook's bootstrap library | A/B test bootstrap CIs for business metrics |
| PyMC | Bayesian modeling | Hierarchical models, Bayesian A/B testing (Ch 8) |
| causalml | Causal inference & HTE | Uplift modeling, CATE estimation, meta-learners (Ch 7) |
- Your company runs 200 A/B tests per year, each at α=0.05 and 80% power. Assume 20% of the tested hypotheses are actually true. How many false positives and false negatives do you expect per year? What does this imply about your product decisions?
- A colleague reports: "Our new feature significantly increased DAU (p=0.038, n=50,000)." What three questions would you ask before acting on this finding?
- You need a 95% CI for the 95th percentile of customer lifetime value. Why can't you use a normal approximation, and how would you compute it correctly?
- Read the abstract of "Always Valid Inference" (Johari et al., 2022). In one paragraph, explain why classical t-tests become invalid when you peek at results, and how anytime-valid inference fixes this.