Chapter 01 Part I · Foundations

Statistical Foundations
& Probability Theory

The bedrock of everything. Most analytics errors — wrong conclusions, over-confident predictions, misleading dashboards — trace back to misunderstood statistics. This chapter builds an unshakeable foundation.

📖 ~60 min reading

💻 12 code examples

🔬 6 research papers

🧠 4 core concepts

Why Statistics Is Not What You Think It Is

Most people learn statistics backwards. They learn formulas first, intuition last — if ever. The result is analysts who can run a t-test but can't tell you what it actually means. This chapter fixes that.

Statistics is not a collection of tests. It is a framework for reasoning under uncertainty. Every statistical procedure answers the same fundamental question: "Given that I've only seen a sample of reality, what can I conclude about all of reality?"

🔑 The Fundamental Problem of Statistics

You want to know something about a population (all customers, all users, all possible outcomes), but you can only observe a sample (the data you actually have). Statistics is the machinery for making rigorous claims about populations from samples — while being honest about the uncertainty that gap creates.

1.1 Probability — The Language of Uncertainty

Two Interpretations That Lead to Different Statistics

There are two fundamentally different ways to interpret probability, and they lead to entirely different statistical frameworks. Understanding the difference is not academic — it determines how you should analyze data.

Frequentist Probability

Definition: Long-run frequency of an event over infinite repetitions
P(heads) = 0.5 means: flip this coin infinitely, ~50% are heads
Parameters are fixed (unknown but not random)
Data is random; parameters are not
Gives us: p-values, confidence intervals, hypothesis tests
Problem: Can't make probability statements about parameters themselves

Bayesian Probability

Definition: Degree of belief or plausibility of a statement
P(heads) = 0.5 means: I believe heads and tails are equally plausible
Parameters are random variables with distributions
Both data and parameters are random
Gives us: posterior distributions, credible intervals, model comparison
Advantage: Can say "there's a 94% probability the effect is positive"

💡 Mental Model — The Weather Forecast

A frequentist meteorologist says: "70% chance of rain" means that on days with conditions like today, it rains 70% of the time historically. The probability is about the repeatability of the setup.

A Bayesian meteorologist says: "70% chance of rain" means I have 70% confidence it will rain today, given all the evidence I've incorporated (satellite data, models, past patterns). The probability is about belief, updated by evidence.

Practical takeaway: Most business analytics lives in frequentist-land (p-values, confidence intervals). Most modern machine learning lives in Bayesian-land (posterior distributions, uncertainty quantification). You need fluency in both.

The Probability Axioms — The Foundation of Everything

All of probability theory rests on three axioms introduced by Kolmogorov in 1933. These are not empirical observations — they are definitions from which everything else is derived.

Kolmogorov's Three Axioms 1. Non-negativity: P(A) ≥ 0 for any event A 2. Normalization: P(Ω) = 1 (something always happens) 3. Additivity: P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅ (mutually exclusive events: their probabilities simply add)

From these three axioms, every probability rule follows — the complement rule, conditional probability, Bayes' theorem, the law of total probability. Understanding this hierarchy means you can always derive what you need rather than memorizing formulas.

Conditional Probability & Bayes' Theorem

Conditional probability is how we update beliefs when new evidence arrives. It is the engine of Bayesian reasoning and the mathematical foundation of most machine learning.

Conditional Probability P(A | B) = P(A ∩ B) / P(B) "The probability of A given that B has occurred" Bayes' Theorem (derived from conditional probability) P(θ | data) = P(data | θ) × P(θ) / P(data) posterior likelihood prior evidence

📌 Why Bayes' Theorem Is The Most Important Formula in Analytics

Bayes' theorem tells us how to rationally update beliefs when we see evidence. In analytics: θ is whatever you want to know (conversion rate, effect size, whether a customer will churn), data is what you've observed, prior is what you believed before seeing the data, and posterior is your updated belief. Every time you build a model, run an experiment, or form a business conclusion, you are — whether you know it or not — doing Bayesian reasoning.

1.2 Distributions — The Shape of Reality

A probability distribution describes all possible values a random variable can take and how likely each is. Choosing the right distribution to model your data is one of the most important and most neglected skills in analytics.

Discrete Distributions You Must Know

Distribution	Parameters	When to Use	Real Example
Bernoulli	`p`	Single binary outcome	Did a user click? Did a customer churn this month?
Binomial	`n, p`	Count of successes in n trials	Number of sales in 100 calls; number of defects in a batch
Poisson	`λ`	Count of rare events per unit time/space	Customer support tickets per hour; server errors per day
Geometric	`p`	Number of trials until first success	How many emails until first reply; retries until connection
Negative Binomial	`r, p`	Overdispersed count data (variance > mean)	Purchases per customer (most buy 0, few buy many)

Continuous Distributions You Must Know

Distribution	Parameters	When to Use	Real Example
Normal (Gaussian)	`μ, σ`	Sums of independent variables, measurement error	Heights, IQ scores, prediction errors after log transform
Log-Normal	`μ, σ`	Positive data with right skew; multiplicative processes	Income, website session duration, stock prices, latency
Exponential	`λ`	Time between Poisson events	Time between purchases; time to failure; customer lifetime
Beta	`α, β`	Probabilities and proportions (0 to 1)	Conversion rate uncertainty; click-through rate priors
Gamma	`α, β`	Positive skewed data, waiting times	Insurance claim amounts; time to complete a task
Pareto (Power Law)	`α, x_m`	Heavy-tailed phenomena, "80-20 rule" data	Revenue by customer; social media followers; city sizes

⚠️ The Normal Distribution Trap

The most common mistake in analytics is assuming everything is normally distributed. Revenue, session duration, purchase amounts, customer lifetime value, social engagement — virtually all business metrics are log-normal, exponential, or power-law distributed. When you apply techniques that assume normality to these variables (like standard means and CIs for revenue), you get systematically wrong answers.

Rule of thumb: If a metric can't be negative and has a long right tail, it's probably log-normal. Take the log, analyze that, then exponentiate back.

1.3 The Central Limit Theorem — Why Normal Appears Everywhere

The Central Limit Theorem (CLT) is arguably the most important theorem in all of statistics. It explains why the normal distribution appears so often and justifies most frequentist inferential procedures.

Central Limit Theorem (informal statement) If you take a large enough sample from ANY distribution (with finite mean μ and variance σ²), the distribution of sample means approaches: x̄ ~ Normal(μ, σ²/n) regardless of what distribution the original data has. "Large enough" is typically n ≥ 30 for mild skew, n ≥ 100-300 for heavy-tailed distributions.

🧠 Why This Is Profound

You're measuring something that follows a bizarre, jagged, unknown distribution. You don't know its true mean μ. You take samples of size n=100 repeatedly and compute each sample's mean. Those sample means — despite coming from a weird distribution — will form a normal distribution centered at the true μ. This is what lets us construct confidence intervals and run hypothesis tests without knowing the underlying distribution of our data.

The catch: The CLT applies to means and sums. Not to the individual observations, not to percentiles, not to maxima. A common mistake is applying CLT-based tests to raw skewed data rather than to means.

1.4 Hypothesis Testing — The Most Misused Tool in Analytics

Hypothesis testing is the procedure for deciding, based on data, whether an observed effect is "real" or could plausibly be explained by random chance. It is also the most abused tool in all of analytics, responsible for a replication crisis across multiple fields.

The Formal Framework

State the null hypothesis H₀

H₀ is the "boring" hypothesis — no effect, no difference, nothing interesting. Example: "The new checkout flow has no effect on conversion rate vs the old one." Always try to reject H₀, never accept it.

State the alternative hypothesis H₁

What you believe is actually true. "The new checkout flow has a different conversion rate." Can be one-tailed (direction specified) or two-tailed (either direction).

Choose significance level α before looking at data

α is the false positive rate you're willing to accept. Convention is α=0.05 (5% chance of claiming an effect that doesn't exist). This must be chosen before you see the data — otherwise you're p-hacking.

Compute the test statistic and p-value

The test statistic measures how far your observed data is from what H₀ predicts, in units of standard error. The p-value is the probability of seeing data this extreme (or more) if H₀ were true.

Make the decision

If p < α: reject H₀ (result is "statistically significant"). If p ≥ α: fail to reject H₀. Note: you never "accept" H₀ — absence of evidence is not evidence of absence.

What the p-value Actually Means

⚠️ The Most Misunderstood Concept in All of Analytics

A p-value is NOT:

The probability that H₀ is true
The probability that your result is due to chance
The probability that you're wrong
A measure of the size or importance of an effect

A p-value IS: P(seeing data this extreme | H₀ is true). It assumes H₀ is true and asks how surprising your data would be under that assumption. A p-value of 0.03 means: "If there truly were no effect, there would be only a 3% chance of seeing an effect this large in a study of this size."

Type I & Type II Errors — The Error Tradeoff

Decision Matrix: Reality: H₀ True Reality: H₁ True ┌──────────────────┬──────────────────┐ Decision: Reject H₀ │ Type I Error │ Correct! │ (claim effect exists) │ False Positive │ True Positive │ │ P = α │ P = Power (1-β) │ ├──────────────────┼──────────────────┤ Decision: Fail to │ Correct! │ Type II Error │ reject H₀ │ True Negative │ False Negative │ (no effect found) │ P = 1 - α │ P = β │ └──────────────────┴──────────────────┘ Key insight: Reducing α (e.g. 0.05 → 0.01) reduces Type I errors but INCREASES Type II errors (you miss real effects). The only way to reduce both simultaneously: increase sample size.

Statistical Power — The Forgotten Half of Testing

Power is the probability of detecting a real effect when one exists (1 - β). It depends on three things: effect size, sample size, and α. Under-powered studies are a major source of irreproducible results.

Power depends on these factors Power = f(effect_size, sample_size, α) Effect size ↑ → Power ↑ (bigger effects are easier to detect) Sample size ↑ → Power ↑ (more data → more signal) α ↑ → Power ↑ (but also more false positives) Standard practice: design studies with Power ≥ 0.80 (80% chance of detecting a real effect of the assumed size)

1.5 Confidence Intervals — Better Than p-values

A 95% confidence interval (CI) contains the true parameter value in 95% of intervals computed using this procedure. Not that there's a 95% chance the specific interval you computed contains the true value — that's the Bayesian credible interval.

💡 The Fishing Net Analogy

Imagine you build nets that catch fish 95% of the time. You cast your net and either catch or don't catch the fish (the true parameter). Any specific cast either catches it or doesn't — there's no probability about this cast. But the procedure (the type of net) is reliable 95% of the time. That's what "95% confidence" means — it's a statement about the procedure, not about your specific interval.

The Bayesian alternative — a credible interval — does give you what you intuitively want: "there's a 95% probability the true value is in this range."

1.6 Effect Size — What Actually Matters Beyond Significance

Statistical significance tells you whether an effect is distinguishable from zero. Effect size tells you whether it's large enough to matter. With enough data, every trivial difference becomes statistically significant. Always report effect sizes.

Effect Size Measure	Formula	Use Case	Small / Medium / Large
Cohen's d	`(μ₁ - μ₂) / σ_pooled`	Comparing two group means	0.2 / 0.5 / 0.8
Pearson's r	`Cov(X,Y) / (σ_X × σ_Y)`	Linear correlation strength	0.1 / 0.3 / 0.5
η² (Eta-squared)	`SS_effect / SS_total`	ANOVA: variance explained	0.01 / 0.06 / 0.14
Odds Ratio	`(a/b) / (c/d)`	Binary outcomes, 2×2 tables	Context-dependent
Relative Risk	`P(outcome\|exposed) / P(outcome\|unexposed)`	Clinical/business experiments	Context-dependent

1.7 Multiple Testing — The Silent Killer of Analytics

If you run 20 hypothesis tests with α=0.05, you expect one false positive purely by chance, even if no effects are real. This is the multiple testing problem, and it is rampant in business analytics — dashboards with 50 metrics, A/B tests that look at many segments, exploratory analyses that test dozens of features.

Family-Wise Error Rate P(at least one false positive) = 1 - (1-α)^m where m = number of tests 20 tests at α=0.05: P(at least one FP) = 1-(0.95)^20 = 64% 100 tests at α=0.05: P(at least one FP) = 99.4%

Corrections for Multiple Testing

Method	Controls	How	When to Use
Bonferroni	FWER	α_adj = α/m	Few tests, need strict control (e.g. safety trials)
Holm-Bonferroni	FWER	Sequential Bonferroni	Slightly more powerful than Bonferroni
Benjamini-Hochberg	FDR	Sort p-values, adjust threshold	Many tests, exploratory analysis, genomics
Benjamini-Yekutieli	FDR	BH with dependence correction	When tests are correlated (e.g. correlated features)

✅ Practical Rule

In business analytics, use Benjamini-Hochberg (BH) by default for any analysis involving multiple tests. It controls the False Discovery Rate (FDR) — the expected proportion of your "significant" findings that are actually false positives. Setting FDR at 10% means that among your significant results, you expect 10% to be wrong — usually acceptable for business decisions.

1.8 The Replication Crisis and What It Means for Analytics

Between 2011 and 2020, large-scale replication attempts found that fewer than 50% of published psychology findings and fewer than 60% of cancer biology findings could be successfully replicated. The causes are relevant to every analyst:

P-hacking: Running many analyses and reporting only p < 0.05 results
HARKing: Hypothesizing After Results are Known — treating exploratory findings as confirmatory
Small samples: Underpowered studies produce inflated effect size estimates even when they get lucky and find p < 0.05
Publication bias: Only significant results get published (or shared in business settings, acted on)
Researcher degrees of freedom: Many undisclosed choices in analysis (which outliers to remove, which covariates to include) inflate Type I error

🔑 What to Do Instead

Pre-register your analyses: Write down your hypothesis, sample size, test statistic, and α before collecting data. Report effect sizes and CIs, not just p-values. Distinguish exploration from confirmation: EDA and hypothesis generation must be separated from hypothesis testing on the same data. Use holdout sets: Any pattern found on training data must be validated on data the analysis hasn't seen.

Industry Applications of Statistical Foundations

2.1 E-Commerce — Conversion Rate Optimization

An e-commerce company wants to test whether a new checkout button color improves conversion rate from 3.2% to 3.5% or better.

Why statistics matters here, precisely:

With 10,000 visitors, a 0.3pp difference is indistinguishable from noise — a test this size has less than 20% power to detect it
Running the test for only 2 days and peeking at results inflates false positive rate from 5% to >40%
Testing 10 button colors simultaneously requires multiple testing correction or you'll ship a "winner" that was random
Conversion rate is Binomial(n, p) distributed — CLT applies well for large n, but not for rare events (e.g. checkout completion = 0.2%)

✅ Real Approach Used by Netflix, Airbnb, Booking.com

These companies use sequential testing (see Research section) instead of fixed-horizon tests. This allows valid early stopping when an effect is large, without inflating Type I error. They also use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance by controlling for pre-experiment behavior, increasing power by 30-50% at no extra sample cost.

2.2 Healthcare Analytics — Drug Efficacy Studies

A pharmaceutical company analyzes whether a new drug reduces recovery time. This is the domain where misunderstood statistics has the highest stakes.

Critical statistical choices:

Primary endpoint pre-registration: Must specify "reduction in hospital stay ≥ 2 days" before seeing data
Intention-to-treat analysis: Include all randomized patients, even dropouts, to avoid selection bias
Adaptive trial design: Modern trials use Bayesian adaptive designs to update sample allocation as evidence accumulates
NNT (Number Needed to Treat): More interpretable than p-values — "treat 8 patients to prevent 1 additional death"

2.3 Financial Services — Risk Modeling

A bank models the probability of loan default to set credit limits and pricing.

Distribution selection is critical:

Default rates are Beta-distributed (bounded 0-1), not normally distributed
Loss amounts follow power-law or extreme value distributions — the tails are everything for risk
VaR (Value at Risk) at 99% relies on tail behavior, which normal distribution assumptions severely underestimate (see 2008 financial crisis)
Copulas are used to model joint distributions of correlated defaults — ignoring correlation was a primary cause of CDO mispricing in 2007

2.4 Technology — Product Analytics

A SaaS company monitors 150 product metrics on a weekly dashboard and wants to detect meaningful changes.

The multiple testing trap in action:

With 150 metrics at α=0.05, expect 7-8 false alarms per week even if nothing changed
Most companies' dashboards with "red/green" alerts are generating massive false positive noise
Solution: Benjamini-Hochberg correction across all monitored metrics, or Bayesian anomaly detection that gives posterior probability rather than binary alert

2.5 Manufacturing — Quality Control

A factory monitors product dimensions to detect when a machine drifts out of calibration.

Classic application of control charts (Shewhart charts):

UCL/LCL (Upper/Lower Control Limits) are set at μ ± 3σ — based on CLT, only 0.27% of in-control samples fall outside these limits
But this assumes measurements are normal and independent — both fail for autocorrelated process data (use ARIMA-based SPC instead)
CUSUM charts detect small persistent shifts faster than Shewhart charts by accumulating evidence over time

When NOT to Use Classic Frequentist Statistics

⚠️ When Frequentist Hypothesis Testing Fails

Observational data: p-values from observational studies do not establish causation — confounders invalidate the inference
Peeking at ongoing experiments: Sequential testing requires corrections — vanilla t-tests on peeked data inflate Type I error dramatically
Very large samples: Everything becomes significant — effect size matters more than significance
Non-exchangeable units: Users in a social network influence each other — independence assumption fails (use cluster-based variance estimation)
Rare events: Binomial test breaks down for p < 0.01 — use exact tests or Poisson approximations

Building a Statistical Analysis Toolkit from Scratch

3.1 Fitting & Diagnosing Distributions

The first step in any analysis is understanding what distribution your data actually follows. Never assume — fit and test.

Python — Distribution fitting and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import (norm, lognorm, expon, gamma,
                          kstest, anderson, shapiro)

# ─── Generate sample business data (replace with your data) ───
np.random.seed(42)
# Revenue data: highly right-skewed, like real purchase amounts
revenue_data = np.random.lognormal(mean=4.5, sigma=1.2, size=1000)

def fit_and_compare_distributions(data, distributions=None):
    """
    Fit multiple distributions to data and rank them by goodness-of-fit.
    
    Uses Kolmogorov-Smirnov test: measures maximum distance between
    empirical CDF and fitted theoretical CDF.
    Smaller KS statistic = better fit.
    """
    if distributions is None:
        distributions = [
            ('Normal',      norm),
            ('Log-Normal',   lognorm),
            ('Exponential',  expon),
            ('Gamma',        gamma),
        ]

    results = []
    for name, dist in distributions:
        # Fit distribution — MLE by default
        params = dist.fit(data)
        
        # KS test: H0 = data comes from this distribution
        ks_stat, p_value = kstest(data, dist.cdf, args=params)
        
        # AIC for model selection (penalizes extra parameters)
        log_likelihood = np.sum(dist.logpdf(data, *params))
        k = len(params)
        aic = 2*k - 2*log_likelihood
        
        results.append({
            'distribution': name,
            'params':        params,
            'ks_statistic':  ks_stat,
            'p_value':       p_value,
            'aic':           aic,
            'fits_well':     p_value > 0.05  # fail to reject = good fit
        })

    # Sort by AIC (lower = better)
    df = pd.DataFrame(results).sort_values('aic')
    return df

result_df = fit_and_compare_distributions(revenue_data)
print(result_df[['distribution', 'ks_statistic', 'p_value', 'aic', 'fits_well']])
# Expected output:
# distribution  ks_statistic  p_value         aic    fits_well
# Log-Normal    0.018         0.89            8823   True      ← best fit
# Gamma         0.031         0.28            8967   True
# Exponential   0.118         0.000           9801   False
# Normal        0.209         0.000           10234  False     ← terrible fit!

def plot_distribution_diagnostics(data, dist_name='lognorm'):
    """4-panel diagnostic plot for distribution assessment."""
    dist = getattr(stats, dist_name)
    params = dist.fit(data)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    fig.suptitle(f'Distribution Diagnostics: {dist_name}', fontsize=14)
    
    # 1. Histogram with fitted PDF
    ax = axes[0,0]
    ax.hist(data, bins=50, density=True, alpha=0.6, color='steelblue', label='Data')
    x = np.linspace(data.min(), data.max(), 200)
    ax.plot(x, dist.pdf(x, *params), 'r-', lw=2, label=f'Fitted {dist_name}')
    ax.set_title('Histogram + Fitted PDF'); ax.legend()
    
    # 2. QQ Plot — if points fall on diagonal, distribution fits well
    ax = axes[0,1]
    stats.probplot(data, dist=dist, sparams=params[:-2], plot=ax)
    ax.set_title('Q-Q Plot (points on line = good fit)')
    
    # 3. Empirical vs Theoretical CDF
    ax = axes[1,0]
    sorted_data = np.sort(data)
    empirical_cdf = np.arange(1, len(data)+1) / len(data)
    theoretical_cdf = dist.cdf(sorted_data, *params)
    ax.plot(sorted_data, empirical_cdf, label='Empirical CDF')
    ax.plot(sorted_data, theoretical_cdf, 'r--', label='Theoretical CDF')
    ax.set_title('CDF Comparison'); ax.legend()
    
    # 4. Log scale histogram (reveals tail behavior)
    ax = axes[1,1]
    ax.hist(np.log(data[data > 0]), bins=50, density=True,
            color='purple', alpha=0.6)
    ax.set_title('Log-transformed Data (should look normal for lognormal)')
    
    plt.tight_layout()
    return fig

3.2 Hypothesis Testing — The Complete Toolkit

Python — Choosing and running the right test

import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import (ttest_ind, ttest_rel, mannwhitneyu,
                          chi2_contingency, fisher_exact,
                          levene, bartlett, shapiro)
import warnings

def choose_and_run_test(group_a: np.ndarray, group_b: np.ndarray,
                         paired: bool = False,
                         alpha: float = 0.05) -> dict:
    """
    Automatically selects the appropriate two-group comparison test
    based on data properties, then runs it.
    
    Decision tree:
    1. Are groups paired? → Paired t-test or Wilcoxon signed-rank
    2. Are groups normally distributed? (Shapiro-Wilk)
    3. Are variances equal? (Levene's test)
    4. Based on above: select t-test variant or Mann-Whitney U
    """
    result = {'test_chosen': None, 'statistic': None,
              'p_value': None, 'reject_h0': None,
              'effect_size': None, 'notes': []}
    
    n_a, n_b = len(group_a), len(group_b)
    
    # Step 1: Check normality (only reliable for n < 5000)
    if n_a < 5000 and n_b < 5000:
        _, p_norm_a = shapiro(group_a[:5000])
        _, p_norm_b = shapiro(group_b[:5000])
        is_normal = (p_norm_a > 0.05) and (p_norm_b > 0.05)
    else:
        # CLT kicks in for large n, normality assumption holds
        is_normal = True
        result['notes'].append('Large n: invoking CLT for normality')
    
    # Step 2: Check equal variances (Levene's is robust to non-normality)
    _, p_var = levene(group_a, group_b)
    equal_var = p_var > 0.05
    
    # Step 3: Choose test
    if paired:
        if is_normal:
            stat, p = ttest_rel(group_a, group_b)
            result['test_chosen'] = 'Paired t-test'
        else:
            stat, p = stats.wilcoxon(group_a, group_b)
            result['test_chosen'] = 'Wilcoxon Signed-Rank'
    elif is_normal:
        if equal_var:
            stat, p = ttest_ind(group_a, group_b, equal_var=True)
            result['test_chosen'] = "Student's t-test"
        else:
            # Welch's t-test: does NOT assume equal variance
            stat, p = ttest_ind(group_a, group_b, equal_var=False)
            result['test_chosen'] = "Welch's t-test (unequal variance)"
    else:
        # Non-parametric alternative: tests median difference
        stat, p = mannwhitneyu(group_a, group_b, alternative='two-sided')
        result['test_chosen'] = 'Mann-Whitney U (non-parametric)'
    
    # Cohen's d effect size (for t-tests)
    pooled_std = np.sqrt((np.std(group_a, ddof=1)**2 +
                          np.std(group_b, ddof=1)**2) / 2)
    cohens_d = (np.mean(group_a) - np.mean(group_b)) / pooled_std
    
    result.update({
        'statistic':   stat,
        'p_value':     p,
        'reject_h0':   p < alpha,
        'effect_size': cohens_d,
        'effect_interpretation': (
            'negligible' if abs(cohens_d) < 0.2 else
            'small'      if abs(cohens_d) < 0.5 else
            'medium'     if abs(cohens_d) < 0.8 else
            'large'
        ),
        'equal_variance': equal_var,
        'normal_data':    is_normal,
    })
    return result

# ─── Example: A/B test on conversion value ────────────────────
np.random.seed(42)
control   = np.random.lognormal(4.0, 1.0, 500)  # control group
treatment = np.random.lognormal(4.1, 1.0, 500)  # 10% higher mean

test_result = choose_and_run_test(control, treatment)
for k, v in test_result.items():
    if v is not None: print(f"  {k}: {v}")

# Output:
#   test_chosen: Welch's t-test (unequal variance)
#   p_value: 0.031
#   reject_h0: True
#   effect_size: 0.138
#   effect_interpretation: small
#
# Conclusion: Statistically significant (p=0.031) but SMALL effect.
# The new treatment adds ~10% to revenue BUT the practical question is:
# Is a Cohen's d of 0.138 worth the implementation cost?

3.3 Power Analysis — Design Your Study Right

Python — Sample size calculator and power curves

from statsmodels.stats.power import TTestIndPower, NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
import numpy as np
import matplotlib.pyplot as plt

def ab_test_sample_size(
    baseline_rate:   float,   # current conversion rate, e.g. 0.05
    minimum_effect:  float,   # smallest effect worth detecting, e.g. 0.01
    alpha:           float = 0.05,
    power:           float = 0.80,
    two_tailed:      bool  = True
) -> dict:
    """
    Calculate required sample size for a proportion A/B test.
    
    This is the FIRST thing you should do before running any experiment.
    Running without power analysis is how you waste months on
    underpowered tests that give you garbage results.
    """
    treatment_rate = baseline_rate + minimum_effect
    
    # Cohen's h for proportions
    effect_size = proportion_effectsize(treatment_rate, baseline_rate)
    
    analysis = NormalIndPower()
    n_per_group = analysis.solve_power(
        effect_size = effect_size,
        power       = power,
        alpha       = alpha,
        alternative = 'two-sided' if two_tailed else 'larger'
    )
    n_per_group = int(np.ceil(n_per_group))
    
    return {
        'n_per_group':    n_per_group,
        'total_n':        n_per_group * 2,
        'effect_size_h':  effect_size,
        'baseline':       baseline_rate,
        'treatment':      treatment_rate,
        'relative_lift':  minimum_effect / baseline_rate,
        'days_needed':    None  # fill in: n / daily_traffic
    }

# ─── Real scenarios ────────────────────────────────────────────
scenarios = [
    ("Checkout CTA color",  0.032, 0.003),  # 3.2% → 3.5%
    ("Pricing page layout", 0.050, 0.010),  # 5.0% → 6.0%
    ("Email subject line",  0.220, 0.020),  # 22% → 24% open rate
]

print(f"{'Experiment':<25} {'N/group':>8} {'Total N':>8} {'Relative lift':>14}")
print("-"*60)
for name, base, lift in scenarios:
    r = ab_test_sample_size(base, lift)
    print(f"{name:<25} {r['n_per_group']:>8,} {r['total_n']:>8,} {r['relative_lift']:>13.1%}")

# Output:
# Experiment                 N/group  Total N  Relative lift
# ------------------------------------------------------------
# Checkout CTA color          29,140   58,280          9.4%
# Pricing page layout          4,414    8,828         20.0%
# Email subject line           3,055    6,110          9.1%
#
# Key insight: Detecting small absolute lifts on low base rates
# requires MASSIVE sample sizes. Most companies run A/B tests
# 10-50x underpowered. Their "winners" are mostly noise.

# ─── Power curve: visualize power across sample sizes ─────────
def plot_power_curve(effect_sizes=[0.1, 0.2, 0.3, 0.5]):
    analysis = TTestIndPower()
    n_range = np.arange(10, 500, 10)
    
    plt.figure(figsize=(10, 6))
    for d in effect_sizes:
        powers = [analysis.power(d, n, alpha=0.05) for n in n_range]
        plt.plot(n_range, powers, label=f"Cohen's d = {d}", lw=2)
    
    plt.axhline(0.8, color='red', linestyle='--', label='80% power threshold')
    plt.xlabel('Sample size per group'); plt.ylabel('Statistical Power')
    plt.title('Power Curves: How much data do you need?')
    plt.legend(); plt.grid(alpha=0.3)
    return plt.gcf()

3.4 Multiple Testing Correction — Protecting Against False Discoveries

Python — Benjamini-Hochberg FDR correction

import numpy as np
import pandas as pd
from statsmodels.stats.multitest import multipletests

def analyze_multiple_metrics(
    p_values:    list,
    metric_names: list,
    alpha:       float = 0.05,
    method:      str   = 'fdr_bh'  # Benjamini-Hochberg FDR
) -> pd.DataFrame:
    """
    Apply multiple testing correction to a batch of p-values.
    
    Methods available:
    'bonferroni'  - FWER control, very conservative
    'holm'        - FWER, step-down, less conservative than Bonferroni
    'fdr_bh'      - Benjamini-Hochberg FDR, best for exploratory analysis
    'fdr_by'      - Benjamini-Yekutieli, FDR with dependence correction
    """
    reject, p_corrected, _, _ = multipletests(
        p_values, alpha=alpha, method=method
    )
    
    df = pd.DataFrame({
        'metric':      metric_names,
        'p_raw':       p_values,
        'p_corrected': p_corrected,
        'significant': reject
    }).sort_values('p_raw')
    
    df['p_raw']       = df['p_raw'].round(4)
    df['p_corrected'] = df['p_corrected'].round(4)
    return df

# ─── Simulate a product dashboard with 20 metrics ─────────────
np.random.seed(42)
# 18 null effects + 2 real effects
p_values = (
    list(np.random.uniform(0, 1, 18))  +  # noise
    [0.003, 0.018]                        # real signals
)
metric_names = [f"metric_{i}" for i in range(18)] + ["dau", "revenue_per_user"]

# Without correction: naive α=0.05
naive_significant = sum(1 for p in p_values if p < 0.05)

# With BH correction
result = analyze_multiple_metrics(p_values, metric_names)
bh_significant = result['significant'].sum()

print(f"Without correction: {naive_significant} 'significant' results")
print(f"With BH correction: {bh_significant} significant results")
print("\nTop findings after correction:")
print(result[result['significant']][['metric','p_raw','p_corrected']])

# Output:
# Without correction: 3 'significant' results  ← 1 false positive!
# With BH correction: 2 significant results    ← correct!
#
# metric             p_raw  p_corrected
# revenue_per_user   0.003        0.060  ← still significant
# dau                0.018        0.180  ← still significant
#
# The correction eliminates the noise and keeps the real signals.

3.5 The Bootstrap — When Formulas Don't Exist

The bootstrap is perhaps the most powerful and underused tool in the statistician's toolkit. When the distribution of a statistic is unknown or hard to derive analytically (median, correlation, custom metrics), the bootstrap estimates it empirically.

Python — Bootstrap for confidence intervals on any statistic

import numpy as np
from scipy.stats import bootstrap as scipy_bootstrap

def bootstrap_ci(
    data:      np.ndarray,
    statistic: callable,
    n_boot:    int   = 10_000,
    ci:        float = 0.95,
    method:    str   = 'percentile'
) -> dict:
    """
    Bootstrap confidence interval for ANY statistic.
    
    Works for: mean, median, std, Gini coefficient,
    AUC, custom revenue metrics — anything you can compute.
    
    method options:
    'percentile'  - simple, robust for symmetric distributions
    'bca'         - bias-corrected and accelerated, more accurate
    """
    boot_stats = np.empty(n_boot)
    n = len(data)
    
    for i in range(n_boot):
        # Resample WITH replacement — the core of bootstrap
        sample = data[np.random.randint(0, n, n)]
        boot_stats[i] = statistic(sample)
    
    alpha = 1 - ci
    lower = np.percentile(boot_stats, 100 * alpha/2)
    upper = np.percentile(boot_stats, 100 * (1 - alpha/2))
    
    return {
        'point_estimate':  statistic(data),
        'ci_lower':        lower,
        'ci_upper':        upper,
        'ci_width':        upper - lower,
        'n_bootstrap':     n_boot,
        'confidence_level': ci
    }

# ─── Use cases ─────────────────────────────────────────────────
np.random.seed(42)
revenue = np.random.lognormal(5, 1.5, 1000)  # realistic revenue data

# 1. CI for median (no closed-form formula)
median_ci = bootstrap_ci(revenue, np.median)
print(f"Median revenue: ${median_ci['point_estimate']:.0f}")
print(f"95% CI: (${median_ci['ci_lower']:.0f}, ${median_ci['ci_upper']:.0f})")

# 2. CI for 90th percentile (extremely skewed, bootstrap essential)
p90_ci = bootstrap_ci(revenue, lambda x: np.percentile(x, 90))
print(f"\n90th percentile revenue: ${p90_ci['point_estimate']:.0f}")
print(f"95% CI: (${p90_ci['ci_lower']:.0f}, ${p90_ci['ci_upper']:.0f})")

# 3. CI for Gini coefficient (inequality measure)
# No simple formula exists — bootstrap is the only practical option
def gini(x):
    n = len(x)
    s = np.sort(x)
    return (2 * np.sum(np.arange(1, n+1) * s) / (n * s.sum())) - (n+1)/n

gini_ci = bootstrap_ci(revenue, gini)
print(f"\nGini coefficient: {gini_ci['point_estimate']:.3f}")
print(f"95% CI: ({gini_ci['ci_lower']:.3f}, {gini_ci['ci_upper']:.3f})")

🧠 Section 3 Check — Implementation

You have purchase amount data that's right-skewed (most users spend $0–50, a few spend $500+). You want to compare average purchase amounts between two versions of a pricing page. What test would choose_and_run_test() select and why? What would you do differently?
An A/B test on a checkout page with baseline conversion 1.5% needs to detect a 0.3pp lift. Calculate the required sample size using the code above. If your site gets 5,000 daily visitors (split equally), how many days must the test run?
A product dashboard monitors 50 metrics. After a feature launch, 4 metrics show p < 0.05. After applying BH correction, only 1 remains significant. What does this tell you about the launch? What's the next step?

Hint for Q1: Think about what "right-skewed" implies for the distribution, and why the log transform helps before applying parametric tests.

The Cutting Edge of Statistical Practice

4.1 Sequential Testing & Always-Valid Inference

Classical A/B testing requires you to fix sample size in advance and look at results only once. In practice, businesses peek at results continuously — which inflates Type I error dramatically. Sequential testing solves this by designing tests that are valid at any stopping time.

Always Valid Inference: Continuous Monitoring of A/B Tests

Johari et al., 2022 · Spotify/Stanford

Introduces anytime-valid p-values and confidence sequences that maintain validity regardless of when you stop a test. Used in production at Spotify. The key insight: replace fixed-horizon t-tests with e-values and martingale-based tests. Allows early stopping when an effect is large or futility stopping when an effect is clearly absent, without inflating false positive rates.

Peeking at A/B Tests: Why it matters, and what to do about it

Johari, Pekelis & Walsh, 2015 · Optimizely

Quantifies exactly how much peeking inflates false positive rates (a test peeked at every day for 30 days has a true α of ~40% not 5%). Proposes the mixture sequential probability ratio test as a solution. Directly applicable to any analytics team running A/B tests.

What This Means for Your Work

Replace fixed-horizon tests with sequential tests (libraries: scipy.stats upcoming, or streamlit-multivariate-testing)
If you must peek: pre-register the peek times and use Bonferroni correction for the number of peeks
Bayesian testing is naturally sequential — posterior probability updates are always valid to examine

4.2 The p-value Reform Movement

In 2019, a letter signed by over 800 statisticians in Nature called for retiring the concept of "statistical significance." This is not fringe — it reflects a genuine crisis in how statistics is practiced.

Retire Statistical Significance

Amrhein, Greenland & McShane, Nature, 2019 · 854 signatories

Argues that the bright-line p < 0.05 threshold causes more harm than good: it creates a false binary (significant/not), encourages p-hacking, and discards useful information. Proposes: report full confidence intervals, effect sizes, and the practical significance of findings. Stop using "significant" as a word entirely.

Moving to a World Beyond "p < 0.05"

Wasserstein, Schirm & Lazar, American Statistician, 2019

The American Statistical Association's second statement on p-values. Introduces the ATOM framework: Accept uncertainty, be Thoughtful, Open, and Modest. Recommends reporting effect sizes and CIs rather than binary significance thresholds. Highly practical for applied analytics.

4.3 Conformal Prediction — Distribution-Free Uncertainty

Classical confidence intervals make distributional assumptions (usually normality). Conformal prediction produces valid prediction intervals with no distributional assumptions whatsoever — intervals that are guaranteed to cover the true value with at least (1-α) probability, regardless of the data distribution.

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Angelopoulos & Bates, 2022 · Berkeley

Accessible tutorial on conformal prediction. Shows how to wrap any ML model with valid uncertainty intervals. Key result: for any model f and any new test point, conformal prediction guarantees P(y ∈ C(x)) ≥ 1-α without any assumptions on the data distribution. Used in production at Amazon, Airbnb for demand forecasting uncertainty.

🔬 Why This Matters for Analytics

Most prediction intervals in business analytics assume the errors are normally distributed — they're not (revenue, clicks, durations are all non-normal). Conformal prediction gives you valid intervals for any model on any data. The Python library MAPIE implements this and wraps any sklearn-compatible model. This is rapidly becoming the standard for uncertainty quantification in production ML.

4.4 Bootstrap Variants & Modern Resampling

An Introduction to the Bootstrap

Efron & Tibshirani, 1993 · The foundational text

The original bootstrap paper and book. Still the essential reference. Key insight: the bootstrap approximates the sampling distribution of a statistic by using the empirical distribution of the data as a proxy for the true distribution. For most statistics of interest in business analytics, the bootstrap is more accurate than analytical formulas.

CUPED: Improving Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data

Deng, Xu, Kohavi & Walker, 2013 · Microsoft Research

CUPED (Controlled-experiment Using Pre-Experiment Data) uses variance reduction to dramatically increase the power of A/B tests without increasing sample size. The key idea: regress out pre-experiment behavior to reduce noise. Implementations at Microsoft, Netflix, and Airbnb show 30-50% variance reduction, effectively requiring half the sample size. Directly applicable to any team running online experiments.

4.5 Open Research Problems

These are areas where the current methods are known to be inadequate and active research is ongoing:

Network interference in A/B tests: When units (users) influence each other through a network, the independence assumption fails. Current solutions (cluster randomization, ego-network experiments) are imperfect. Active research at LinkedIn, Twitter, Meta.
Long-term effects: Most A/B tests run for 1-2 weeks and measure short-term metrics. The correlation between short-term and long-term effects is often low or negative. No consensus solution exists.
Heterogeneous treatment effects: Average treatment effects hide that different user segments respond differently. CATE (Conditional Average Treatment Effect) estimation is a growing area using causal forests and meta-learners.
Valid inference after model selection: Using the same data to select a model and estimate its performance is known to produce over-confident estimates, but the right correction is still debated.
Bandit vs A/B testing: Bayesian multi-armed bandits adapt assignment probabilities as evidence accumulates, but their frequentist properties are not always well understood. When to use bandits vs fixed A/B tests is not fully settled.

4.6 Key Tools & Libraries to Know in 2026

Library	Purpose	When to Use
scipy.stats	Core statistical tests	t-tests, KS-tests, distribution fitting — your daily workhorse
statsmodels	Regression, time series, stats models	OLS, logistic regression, ARIMA, survival analysis
pingouin	Statistical tests with effect sizes	ANOVA, pairwise tests — reports effect sizes automatically
MAPIE	Conformal prediction	Valid prediction intervals for any sklearn model
bootstrapped	Facebook's bootstrap library	A/B test bootstrap CIs for business metrics
PyMC	Bayesian modeling	Hierarchical models, Bayesian A/B testing (Ch 8)
causalml	Causal inference & HTE	Uplift modeling, CATE estimation, meta-learners (Ch 7)

🧠 Chapter 1 — Final Questions

Your company runs 200 A/B tests per year, each at α=0.05 and 80% power. Assume 20% of the tested hypotheses are actually true. How many false positives and false negatives do you expect per year? What does this imply about your product decisions?
A colleague reports: "Our new feature significantly increased DAU (p=0.038, n=50,000)." What three questions would you ask before acting on this finding?
You need a 95% CI for the 95th percentile of customer lifetime value. Why can't you use a normal approximation, and how would you compute it correctly?
Read the abstract of "Always Valid Inference" (Johari et al., 2022). In one paragraph, explain why classical t-tests become invalid when you peek at results, and how anytime-valid inference fixes this.

For Q1: Use Bayes' theorem on your testing results. Prior P(true hypothesis) = 0.2, likelihood of detecting = power = 0.8, Type I rate = α = 0.05. Compute the positive predictive value of a "significant" finding.

← Previous

Table of Contents

Ch 02: EDA & Feature Engineering (Coming Soon)