Chapter 03 Part I · Foundations

Regression & Predictive Modeling
Beyond the Basics

Regression is the workhorse of predictive analytics. Most practitioners know how to fit a line. Far fewer understand when that line is wrong, why it fails, and how to fix it. This chapter covers the complete theory, every assumption, all major violations, and production-grade implementation across five regression families.

📖 ~70 min reading 💻 14 code examples 🔬 6 research papers 🧠 6 core concepts

What Regression Is Actually Doing

Regression finds the function f such that E[Y | X] = f(X) — the expected value of Y given X. That is all regression is. Every technique we add (regularization, polynomial terms, link functions) is in service of estimating this conditional expectation more accurately or under weaker assumptions.

Most practitioners treat regression as a black box. They call LinearRegression().fit(X, y), look at R², and move on. This produces models that appear to work on training data but fail silently in production — because the assumptions that make OLS the best possible estimator were never checked, and in real business data they are routinely violated.

🔑 The Core Mental Model

Linear regression makes a specific claim: the relationship between X and Y is linear, additive, and errors are random noise with constant variance. When that claim is approximately true, OLS is the Best Linear Unbiased Estimator (Gauss-Markov theorem). When it is false, OLS gives systematically wrong answers with no warning. Your job is to verify the claim holds — or to use a method that makes weaker claims.

3.1 OLS — The Foundation

OLS minimizes the sum of squared residuals. Why squared? It penalizes large errors disproportionately, produces a unique closed-form solution, and equals the maximum likelihood estimator under Gaussian errors.

OLS closed-form solution β = (X'X)^-1 X'y X = design matrix (n observations × p features) y = target vector (n × 1) β = coefficient vector (p × 1) Geometrically: projects y onto the column space of X. Residuals e = y - Xβ are orthogonal to every column of X.

The Gauss-Markov Theorem — Five Assumptions

Under these five conditions, OLS is BLUE: the Best Linear Unbiased Estimator. Violating any condition degrades the estimator in a specific, predictable way.

#	Assumption	What It Means	Consequence of Violation	How to Test
GM1	Linearity	E[Y\|X] is linear in X	Systematic bias; non-random residual patterns	Residual vs fitted plot; RESET test
GM2	No perfect multicollinearity	No predictor is exact linear combo of others	X'X is singular; coefficients undefined	VIF per feature
GM3	Strict exogeneity	E[ε\|X] = 0 (no omitted variable bias)	Biased and inconsistent coefficients	Domain knowledge; instrumental variables
GM4	Homoskedasticity	Var(ε\|X) = σ² constant	Wrong standard errors; invalid CIs and t-tests	Breusch-Pagan test; scale-location plot
GM5	No autocorrelation	Cov(ε_i, ε_j) = 0	SEs underestimated; spurious significance	Durbin-Watson; ACF plot of residuals

3.2 The Four Diagnostic Plots — Your Most Important Tool

Every regression analysis must include residual diagnostics. These four plots tell you exactly which assumption is failing and how to fix it. Never skip them.

Plot 1: Residuals vs Fitted Values Good pattern: random scatter around zero, no structure Nonlinearity: systematic curve (S or U shape) Heteroskedas.: funnel shape (variance grows with fitted value) Plot 2: Q-Q Plot of Residuals Good pattern: points fall on the diagonal line Heavy tails: S-curve shape at both ends Right skew: points curve above the line on the right Note: non-normal residuals affect CIs only, not coefficient estimates Plot 3: Scale-Location (sqrt|residuals| vs fitted) Good pattern: flat horizontal red line Heteroskedas.: upward slope (variance increases with fitted value) Plot 4: Residuals vs Leverage (Cook's Distance) Leverage: how far an observation is from the center of X-space Cook's D > 4/n: observation is unduly influencing the fit High lev + high resid: dangerous influential point, investigate

3.3 Multicollinearity — The Hidden Problem

Multicollinearity occurs when predictors are highly correlated. It does not bias coefficient estimates but inflates their variance dramatically — making them unstable, sensitive to small data changes, and statistically insignificant even when the feature is genuinely important.

⚠️ Why This Matters in Practice

Suppose you have age and years_of_experience (r=0.95). The model cannot separate which one deserves the coefficient — it can put all weight on age and none on experience, or vice versa, and get roughly the same predictions. Both coefficients become unstable and may flip signs with one additional observation. The predictions may be fine. The coefficients are meaningless. When interpretability is required (credit scoring, healthcare, regulatory compliance) this is a serious problem.

Variance Inflation Factor (VIF) VIF(j) = 1 / (1 - R²_j) R²_j = R² from regressing feature j on all other features. VIF = 1: No correlation with other features (ideal) VIF 1-5: Acceptable VIF 5-10: Moderate multicollinearity (investigate) VIF > 10: Severe multicollinearity (take action) VIF = ∞: Perfect collinearity (model will fail) Fixes: remove one correlated feature, combine via PCA, or use Ridge regression (naturally handles collinearity).

3.4 Regularization — The Solution to Overfitting

Regularization adds a penalty to the loss function that discourages large coefficients. This introduces a small bias but reduces variance substantially — the classic bias-variance tradeoff. The result generalizes better to new data.

Ridge (L2): shrinks all coefficients toward zero Loss = ∑(y_i - ŷ_i)² + λ ∑ β_j² All coefficients shrink toward zero smoothly. None become exactly zero. Keeps all features. Best when all features may be relevant. Handles multicollinearity well (adds λI to X'X before inversion). Lasso (L1): drives some coefficients to exactly zero Loss = ∑(y_i - ŷ_i)² + λ ∑ |β_j| Performs automatic feature selection. Unstable when features are highly correlated (picks one arbitrarily). Best when you believe most features are irrelevant. Elastic Net: combines L1 and L2 Loss = ∑(y_i - ŷ_i)² + λ₁ ∑ |β_j| + λ₂ ∑ β_j² Sparse like Lasso, stable like Ridge. Best default for p >> n.

💡 Why L1 Produces Sparsity But L2 Does Not

The geometry explains it. The L2 penalty is a sphere — smooth, no corners. When the loss ellipse touches the sphere, it touches at a smooth point where no coefficient is exactly zero. The L1 penalty is a diamond with corners on the axes. The loss ellipse almost always touches those corners first, which drives that coefficient to exactly zero. This is why Lasso produces automatic feature selection and Ridge does not.

3.5 Heteroskedasticity — When Error Variance Is Not Constant

Heteroskedasticity means the variance of errors depends on X. In business data this is the rule, not the exception. Revenue prediction errors are larger for high-revenue customers. Ignoring this produces wrong standard errors, invalidating all t-tests and confidence intervals.

Pattern	Common Cause	Correct Fix
Variance ∝ E[Y]²	Multiplicative process (revenue, income)	Log-transform Y; use log-linear model or Gamma GLM
Variance ∝ E[Y]	Count data, proportions	WLS with weights 1/ŷ; Poisson or Gamma GLM
Different variance per group	Different subpopulations	Cluster-robust standard errors (HC3 by cluster)
Unknown structure	General business data	White HC3 robust standard errors (default choice)

3.6 Regression Family: When to Use Each

Polynomial Regression

Add polynomial terms when the relationship is nonlinear but smooth. Key risk: high-degree polynomials oscillate wildly at boundaries (Runge's phenomenon). Prefer splines or GAMs over degree >3 polynomials for smooth nonlinear relationships.

Quantile Regression

OLS models the mean of Y given X. Quantile regression models any percentile. Use it when you care about tail behavior (worst-case delivery time, 95th percentile latency, Value at Risk), when data is heavily skewed and log-transform doesn't help, or when different user segments have heterogeneous effects.

Generalized Linear Models (GLMs)

GLMs extend OLS to non-normal response variables through a link function that connects the linear predictor to E[Y|X].

Response Type	Distribution	Link	Business Example
Binary (0/1)	Binomial	Logit	Churn, click, default, fraud
Count (≥0)	Poisson	Log	Support tickets, purchases, events per day
Overdispersed count	Negative Binomial	Log	Purchases per customer (variance > mean)
Positive continuous	Gamma	Log	Revenue, LTV, claim amounts, duration
Rate (0 to 1)	Beta	Logit	Conversion rate, market share, CTR
Zero + continuous	Zero-Inflated (two-part)	Mixed	App purchases (most = 0)

Generalized Additive Models (GAMs)

GAMs replace linear terms β_jX_j with smooth functions f_j(X_j) fit using splines. This provides nonlinear flexibility while maintaining full interpretability — you can plot each f_j to see exactly how each feature affects the outcome. Best for >1000 observations per feature and relationships known to be nonlinear.

3.7 The Bias-Variance Tradeoff

Expected prediction error decomposition E[(Y - ŷ)²] = Bias² + Variance + Irreducible Noise Bias²: Systematic error from wrong model form (underfitting) Variance: Error from sensitivity to training data (overfitting) Noise: Cannot be reduced regardless of model complexity Goal: minimize total error, not just bias. Regularization trades small bias for large variance reduction.

Model	Bias	Variance	Train Error	Test Error
Too simple (underfit)	High	Low	High	High
Just right	Low	Low	Low	Low
Too complex (overfit)	Low	High	Very low	High

📌 When to Use Each Regression Type

OLS: Continuous Y, approximately linear relationship, need coefficient interpretation. Always use HC3 robust SEs. Ridge: Many features, possible multicollinearity, need all features retained. Lasso/Elastic Net: Believe most features are irrelevant, want automatic selection. GLM: Non-normal response (binary, count, positive continuous). Quantile: Tail prediction, heteroskedastic data, risk estimation. GAM: Known nonlinear relationships, interpretability required, enough data.

Regression Across Industries

3.8 E-Commerce — Customer Lifetime Value Prediction

Predicting 12-month revenue from each customer to allocate acquisition spend. One of the highest-value regression problems in business.

Why naive OLS fails:

LTV is right-skewed — most customers contribute very little, a few enormously. OLS is distorted by high-LTV outliers
LTV is bounded below at zero. OLS can and will predict negative values
Variance of errors is proportional to predicted value — heteroskedasticity is guaranteed

Production approach:

Two-part model: Part 1 classifies zero-LTV vs positive-LTV (logistic). Part 2 models log(LTV | LTV > 0) via Gamma GLM. Multiply: P(LTV > 0) × E[LTV | LTV > 0]
Gamma GLM with log link: Naturally handles positive right-skewed data, heteroskedasticity is in the likelihood, predictions always positive
Quantile regression at median: More robust than mean for skewed distributions, less distorted by whale customers

✅ Industry Standard at Subscription Companies

Spotify, Netflix, and DoorDash use Gamma GLM with log link for LTV. The Gamma distribution naturally handles positive continuous right-skewed data, variance proportional to mean² is built into the likelihood, and the log link ensures always-positive predictions. No need for manual log-transform and no smearing correction required.

3.9 Real Estate — Hedonic Property Valuation

Predicting property prices from features (size, location, age, amenities).

Spatial autocorrelation: Nearby properties have correlated errors (GM5 violation). Standard SEs are underestimated. Use Moran's I test; correct with spatial lag models or cluster-robust SEs by neighborhood
Log-linear specification (hedonic pricing): Price = exp(β₀ + β₁×log(size) + ...). Coefficients are elasticities: "1% increase in size increases price by β₁%"
Interaction terms essential: The value of an extra bedroom depends heavily on neighborhood. OLS without the interaction will average across neighborhoods and mislead
GAM for age: The age-price relationship is often U-shaped (new is premium, mid-age discounted, historic is premium). Linear terms cannot capture this; GAMs handle it naturally

3.10 Healthcare — Hospital Length of Stay

Predicting days in hospital for capacity planning.

Length of stay is a count variable (integer ≥1) → use Negative Binomial GLM, not OLS
Strong right skew: most stays 2–5 days, some 30+ days. A few complex cases dominate the mean — OLS predictions for typical patients are biased upward
Censoring problem: Some patients are still admitted when data is pulled — their true LOS is unknown. Standard OLS ignores this. Use survival models (Cox regression) or Tobit regression for censored data

3.11 Finance — Probability of Loan Default

Binary outcome (default yes/no) requires logistic regression, not OLS.

Severe class imbalance: 1–5% default rate. Naive logistic regression predicts "no default" for everyone, achieving 95%+ accuracy while being completely useless. Use class weights or threshold tuning
Regulatory requirement for monotone WoE features: each feature must have monotonic relationship with default probability for auditability
The scorecard transformation: Model coefficients are converted into integer points per feature so loan officers can compute scores manually. Still used because regulators can audit it completely

3.12 Ride-sharing — Dynamic Pricing

Predicting the surge multiplier to balance supply and demand in real-time.

Target: positive continuous multiplier → Gamma GLM or log-linear model
Surge 5 minutes ago predicts surge now better than almost any other feature — temporal autocorrelation is severe. Must include lagged features and use time-series-aware cross-validation
Quantile regression at 90th–95th percentile estimates worst-case surge for customer-facing communication ("expect up to 3.5x surcharge")

⚠️ When Regression Is the Wrong Tool Entirely

You need to establish causality: Regression on observational data estimates correlation, not causation. Use Ch 7 (Causal Inference)
Your outcome is a category: Use classification (Ch 4). Linear regression on binary outcomes produces values outside [0,1]
Structural breaks exist: If the X-Y relationship changes at a threshold or over time, a single model averages the regimes and misleads both
Complex interactions across many features: Gradient boosting (Ch 4) will outperform linear regression significantly. Use regression when interpretability matters more than accuracy

Production Regression Toolkit

3.13 Complete Regression Diagnostics

Python — Test all five Gauss-Markov assumptions automatically

import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, linear_reset
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.outliers_influence import variance_inflation_factor

def full_regression_diagnostics(X, y, model_name="OLS"):
    """
    Run all five Gauss-Markov assumption tests.
    Returns: dict of pass/fail per test + recommended fixes.
    Requires: X as pandas DataFrame with column names.
    """
    X_c = sm.add_constant(X)
    model = sm.OLS(y, X_c).fit()
    residuals = model.resid
    n, p = X.shape
    report = {}

    # 1. Linearity: Ramsey RESET test
    # H0: model correctly specified. p > 0.05 means OK.
    reset = linear_reset(model, power=2)
    report["linearity_ok"]    = reset.pvalue > 0.05
    report["linearity_p"]     = reset.pvalue

    # 2. Homoskedasticity: Breusch-Pagan test
    # H0: constant variance. p > 0.05 means OK.
    _, bp_p, _, _ = het_breuschpagan(residuals, X_c)
    report["homosked_ok"]     = bp_p > 0.05
    report["homosked_p"]      = bp_p

    # 3. Normality of residuals (mainly for small n)
    # For n > 500, CLT makes this less critical.
    sample = residuals.sample(min(5000, n), random_state=42)
    _, norm_p = stats.shapiro(sample)
    report["normality_ok"]    = norm_p > 0.05 or n > 500
    report["normality_p"]     = norm_p

    # 4. No autocorrelation: Durbin-Watson
    # Value near 2 = no autocorrelation. <1.5 or >2.5 is a concern.
    dw = durbin_watson(residuals)
    report["autocorr_ok"]     = 1.5 < dw < 2.5
    report["durbin_watson"]   = dw

    # 5. Multicollinearity: VIF per feature
    vif = pd.DataFrame({
        "feature": X.columns,
        "VIF": [variance_inflation_factor(X_c.values, i+1)
                for i in range(p)]
    }).sort_values("VIF", ascending=False)
    report["max_vif"]         = vif["VIF"].max()
    report["multicollin_ok"]  = report["max_vif"] < 10
    report["vif_table"]       = vif

    # 6. Influential points: Cook's Distance > 4/n
    cooks_d = model.get_influence().cooks_distance[0]
    n_influe = (cooks_d > 4/n).sum()
    report["influential_ok"]  = n_influe < n * 0.05
    report["n_influential"]   = n_influe

    # Print readable summary
    checks = [
        ("Linearity (RESET)",       report["linearity_ok"],
         f"p={report['linearity_p']:.3f}",       "Add polynomial terms or use GAM"),
        ("Homoskedasticity (BP)",    report["homosked_ok"],
         f"p={report['homosked_p']:.3f}",        "Use HC3 robust SEs, WLS, or log(Y)"),
        ("Normality (Shapiro-Wilk)", report["normality_ok"],
         f"p={report['normality_p']:.3f}",       "Check outliers; OK if n>500"),
        ("No autocorrelation (DW)",  report["autocorr_ok"],
         f"DW={report['durbin_watson']:.2f}",    "Add lagged features; use HAC SEs"),
        ("No multicollinearity",     report["multicollin_ok"],
         f"maxVIF={report['max_vif']:.1f}",      "Remove correlated features or use Ridge"),
        ("No influential outliers",  report["influential_ok"],
         f"n={report['n_influential']}",         "Investigate Cook's D > 4/n points"),
    ]
    print("="*58); print(f" DIAGNOSTICS: {model_name}"); print("="*58)
    for name, ok, stat, fix in checks:
        icon = "PASS" if ok else "FAIL"
        print(f"  [{icon}] {name:<30} {stat}")
        if not ok: print(f"         Fix: {fix}")
    print(f"\n  R2={model.rsquared:.3f}  AdjR2={model.rsquared_adj:.3f}  AIC={model.aic:.1f}")
    return report, model

# Usage
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
report, model = full_regression_diagnostics(housing.data, housing.target)
# Expected: FAIL linearity (relationships are nonlinear)
#           FAIL homoskedasticity (variance grows with price)
#           Both tell you: log-transform Y or use Gamma GLM

3.14 Regularized Regression with Cross-Validated Lambda

Python — Ridge, Lasso, Elastic Net with proper pipeline

import numpy as np
import pandas as pd
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42)

alphas = np.logspace(-4, 4, 100)   # search 0.0001 to 10000

models = {
    "Ridge":      RidgeCV(alphas=alphas, cv=5),
    "Lasso":      LassoCV(alphas=alphas, cv=5, max_iter=10000),
    "ElasticNet": ElasticNetCV(
                    alphas=alphas,
                    l1_ratio=[.1, .5, .7, .9, 1.0],  # CV picks best ratio too
                    cv=5, max_iter=10000)
}

# IMPORTANT: always scale before regularization.
# Without scaling, features with large magnitude get penalized more
# just because of their units, not their actual importance.
print(ff"{'Model':12} {'RMSE':>8} {'R2':>7} {'Best alpha':>12} {'Non-zero':>10}")
print("-"*55)

for name, reg in models.items():
    pipe = Pipeline([("scale", StandardScaler()), ("reg", reg)])
    pipe.fit(X_tr, y_tr)
    y_pred = pipe.predict(X_te)
    rmse = np.sqrt(np.mean((y_te - y_pred)**2))
    r2   = 1 - np.mean((y_te - y_pred)**2) / np.var(y_te)
    alpha_best = getattr(reg, 'alpha_', 'N/A')
    nz = (np.abs(reg.coef_) > 1e-6).sum()
    print(ff"{name:12} {rmse:8.4f} {r2:7.3f} {str(alpha_best):>12} {nz:>10}/{X_tr.shape[1]}")

# Lasso feature selection: which features survived?
lasso_coefs = pd.Series(models["Lasso"].coef_, index=X_tr.columns)
survived = lasso_coefs[lasso_coefs != 0].sort_values(key=abs, ascending=False)
print(f"\nLasso kept {len(survived)} features (sorted by |coefficient|):")
print(survived.round(4))

3.15 Robust Standard Errors — HC3 by Default

Python — When and how to use heteroskedasticity-robust SEs

import statsmodels.api as sm
import pandas as pd

def regression_with_robust_ses(X, y):
    """
    Fit OLS and compare naive vs HC3-robust standard errors.

    Recommendation: use HC3 by default for all cross-sectional data.
    HC3 is conservative (slightly wider CIs) but always valid.
    Naive OLS SEs are only valid under exact homoskedasticity,
    which almost never holds in practice.

    SE types available in statsmodels:
    'HC0' - White (1980): original robust SE
    'HC1' - small-sample correction
    'HC3' - default recommendation: better for small n, most conservative
    'HAC' - Newey-West: also corrects for autocorrelation
    """
    X_c = sm.add_constant(X)
    naive  = sm.OLS(y, X_c).fit()
    robust = sm.OLS(y, X_c).fit(cov_type='HC3')

    comparison = pd.DataFrame({
        "coef":       naive.params,
        "SE_naive":   naive.bse,
        "SE_HC3":     robust.bse,
        "SE_ratio":   robust.bse / naive.bse,  # >1 means naive SEs too small
        "sig_naive":  naive.pvalues < 0.05,
        "sig_robust": robust.pvalues < 0.05,
    })

    # Most important: which results change significance?
    changed = comparison[comparison["sig_naive"] != comparison["sig_robust"]]
    if len(changed) > 0:
        print("WARNING: significance changes with robust SEs:")
        print(changed[["coef","SE_naive","SE_HC3","sig_naive","sig_robust"]])
        print("These results were false positives with naive SEs.")
    else:
        print("No significance changes. Robust SEs confirm all results.")
    return robust, comparison

# Practical rule: if Breusch-Pagan p < 0.05, MUST use HC3.
# In practice: just always use HC3. The cost is near-zero.

3.16 Gamma GLM for Revenue and LTV

Python — The correct model for positive right-skewed outcomes

import statsmodels.api as sm
import numpy as np
import pandas as pd

def fit_gamma_glm(X, y):
    """
    Gamma GLM with log link for positive continuous right-skewed Y.

    Why this is better than log(Y) + OLS:
    1. Predictions always positive (enforced by log link)
    2. Heteroskedasticity (var proportional to mean^2) is built in
    3. No smearing correction needed when interpreting predictions
    4. Coefficients: exp(beta_j) = multiplicative effect on E[Y]
       e.g. exp(0.15) = 1.16 means feature increases Y by 16%
    """
    X_c = sm.add_constant(X)
    glm = sm.GLM(
        y, X_c,
        family=sm.families.Gamma(link=sm.families.links.Log())
    ).fit()

    coef_df = pd.DataFrame({
        "coef":       glm.params,
        "exp_coef":   np.exp(glm.params),   # multiplicative effect
        "pvalue":     glm.pvalues,
        "significant": glm.pvalues < 0.05
    }).sort_values("pvalue")

    print(f"Gamma GLM Results")
    print(f"Pseudo R2 (deviance): {1 - glm.deviance/glm.null_deviance:.4f}")
    print(f"AIC: {glm.aic:.2f}")
    print(f"Dispersion: {glm.scale:.4f}")
    print(f"\nCoefficients (exp(coef) = multiplicative effect on Y):")
    print(coef_df.round(4))
    return glm, coef_df

# Interpreting Gamma GLM coefficients:
# coef = 0.12 for feature X means:
# When X increases by 1 unit, E[Y] is multiplied by exp(0.12) = 1.127
# i.e., Y increases by 12.7%
# This is the log-linear interpretation — natural for business metrics.

3.17 Quantile Regression — Modeling the Tails

Python — When you need more than just the mean

import statsmodels.api as sm
import pandas as pd
import numpy as np

def quantile_regression_suite(X, y, quantiles=None):
    """
    Fit quantile regression at multiple quantiles.

    Use cases:
    - q=0.5 (median): robust alternative to OLS mean
    - q=0.9: worst-case prediction (delivery SLA, load estimates)
    - Comparing coefficients across quantiles reveals heterogeneous effects:
      if beta_j changes a lot from q=0.1 to q=0.9, feature j has
      different effects for low vs high Y values.
    """
    if quantiles is None:
        quantiles = [0.10, 0.25, 0.50, 0.75, 0.90]
    X_c = sm.add_constant(X)
    results = {}
    for q in quantiles:
        qr = sm.QuantReg(y, X_c).fit(q=q)
        results[q] = qr.params

    coef_table = pd.DataFrame(results).T
    coef_table.index.name = "quantile"
    print("Coefficients across quantiles:")
    print(coef_table.round(4))

    # Heterogeneous effect detection: large spread = effect varies by Y level
    spread = coef_table.max() - coef_table.min()
    het = spread[spread > coef_table.abs().mean() * 0.5]
    if len(het) > 0:
        print(f"\nHeterogeneous effects detected for: {list(het.index)}")
        print("These features affect low and high Y values differently.")
    return coef_table

# Practical note: quantile regression minimizes the pinball loss
# L(q, y, y_hat) = q*(y-y_hat) if y>=y_hat, else (1-q)*(y_hat-y)
# This asymmetric loss is why predictions converge to quantile q.

3.18 The Log-Transform and Smearing Correction

Python — The correct way to back-transform predictions

import numpy as np
from sklearn.linear_model import LinearRegression

def log_linear_with_smearing(X_train, y_train, X_test):
    """
    Log-linear model with Duan's smearing correction.

    THE MISTAKE everyone makes:
        model.fit(X, log(y))
        predictions = exp(model.predict(X_test))   # WRONG

    WHY IT'S WRONG:
        E[Y | X] != exp(E[log(Y) | X])
        exp(E[log Y]) = median(Y), not mean(Y)
        For right-skewed Y, mean >> median
        So naive back-transform systematically under-predicts the mean

    CORRECT (Duan 1983):
        E[Y | X] = exp(fitted_log_y) * E[exp(residuals)]
        The factor E[exp(residuals)] is the smearing correction.
    """
    log_y = np.log(y_train)
    model = LinearRegression().fit(X_train, log_y)

    log_pred_train = model.predict(X_train)
    residuals = log_y - log_pred_train

    # Duan smearing factor: nonparametric, works for any residual distribution
    smearing_factor = np.mean(np.exp(residuals))

    log_pred_test = model.predict(X_test)
    y_pred_correct = np.exp(log_pred_test) * smearing_factor
    y_pred_naive   = np.exp(log_pred_test)           # wrong for mean prediction
    y_pred_median  = np.exp(log_pred_test)           # actually correct for median

    print(f"Smearing factor: {smearing_factor:.4f}")
    print(f"Mean prediction (naive):   {y_pred_naive.mean():.2f}  <-- too low")
    print(f"Mean prediction (correct): {y_pred_correct.mean():.2f}  <-- use this")
    print(f"If smearing factor is 1.0: residuals are symmetric, naive is fine")
    return y_pred_correct, smearing_factor

3.19 GAM — Nonlinear and Interpretable

Python — Fit a GAM and identify which features are nonlinear

# pip install pygam
from pygam import LinearGAM, s
import numpy as np
import matplotlib.pyplot as plt

def fit_and_interpret_gam(X, y, feature_names, n_splines=20):
    """
    GAM: E[Y|X] = intercept + f1(X1) + f2(X2) + ... + fp(Xp)
    Each f_j is a smooth spline fit by penalized regression.

    Key diagnostic: Effective Degrees of Freedom (EDF) per term.
    EDF = 1:      linear relationship (f_j is just a line)
    EDF = 2-3:    mild nonlinearity
    EDF > 4:      strong nonlinearity (U-shape, thresholds, etc.)
    """
    X = np.array(X); p = X.shape[1]
    terms = sum(s(i, n_splines=n_splines) for i in range(p))
    gam = LinearGAM(terms).gridsearch(X, y)  # GCV finds optimal smoothing

    print("GAM Results:")
    print(f"Pseudo R2: {gam.statistics_['pseudo_r2']['explained_deviance']:.4f}")
    print(f"AIC:       {gam.statistics_['AIC']:.2f}")
    print(f"\nEffective degrees of freedom per feature:")
    edfs = gam.statistics_['edof_per_coef']

    for i, name in enumerate(feature_names):
        edf = edfs[i]  # approximate; exact slicing depends on pygam version
        status = "LINEAR" if edf < 2 else "NONLINEAR"
        print(f"  {name:25} EDF={edf:.2f}  ({status})")

    # Plot partial dependence for each feature
    fig, axes = plt.subplots(1, p, figsize=(4*p, 3), sharey=False)
    for i, (ax, name) in enumerate(zip(axes, feature_names)):
        XX = gam.generate_X_grid(term=i)
        pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)
        ax.plot(XX[:, i], pdep, lw=2, color='#8B2C2C')
        ax.fill_between(XX[:, i], confi[:,0], confi[:,1], alpha=0.15, color='#8B2C2C')
        ax.axhline(0, color='gray', lw=0.8, linestyle='--')
        ax.set_xlabel(name); ax.set_ylabel("f(feature)")
    plt.tight_layout()
    return gam

🧠 Section 3 — Check Your Understanding

You run full_regression_diagnostics() and get FAIL on homoskedasticity (BP p=0.001), PASS on everything else. Your business requires valid confidence intervals. What are your three options in priority order?
Lasso eliminates 12 of 30 features (RMSE=0.42). Ridge keeps all 30 (RMSE=0.41). Which do you deploy and why? Under what circumstances would you change your answer?
You predict customer revenue using log(revenue) + OLS. Predictions seem systematically too low by about 20%. Your colleague says "just multiply predictions by 1.2." What is the actual problem, and what is the principled fix?
A GAM returns EDF=7.8 for the customer_age feature. What does this tell you about the age-revenue relationship? Could a linear regression or even a 2nd-degree polynomial capture this?

Hint for Q3: The issue is the smearing problem. exp(E[log Y]) = median(Y), not mean(Y). For right-skewed distributions mean > median, so naive back-transformation under-predicts the mean. The fix is Duan's smearing correction: multiply exp(fitted) by mean(exp(residuals_train)). Your colleague's "1.2" approximation might accidentally be close to the correct smearing factor, but it won't generalize to new data segments.

The Cutting Edge of Regression and Prediction

Double Machine Learning — Causal Coefficients at Scale

Classical regression coefficients in observational data are not causal — they are contaminated by confounders. Double Machine Learning uses ML to partial out confounders and produces coefficient estimates that are both causal and valid in high dimensions.

Double/Debiased Machine Learning for Treatment and Structural Parameters

Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey & Robins · Econometrics Journal, 2018

The DML algorithm: to estimate the causal effect of treatment D on outcome Y controlling for high-dimensional X, fit ML models to predict Y from X (get residuals e_Y) and D from X (get residuals e_D), then regress e_Y on e_D. The resulting coefficient is causal and debiased from regularization bias. Used at Amazon, Microsoft, and LinkedIn for pricing and policy evaluation. The DoubleML Python library implements the full framework including inference.

Conformal Prediction — Valid Intervals Without Assumptions

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Angelopoulos & Bates · Berkeley, 2022

Classical prediction intervals assume normally distributed residuals. Conformal prediction produces intervals guaranteed to cover the true value with exactly 1-α probability, for any model, on any data, with no distributional assumptions. The only requirement is exchangeability (approximately i.i.d.). The MAPIE Python library wraps any sklearn model. Increasingly the standard for production forecasting uncertainty.

Neural Additive Models — Deep Learning with GAM Interpretability

Neural Additive Models: Interpretable Machine Learning with Neural Nets

Agarwal, Melnick, Frosst, Zhang, Lengerich, Caruana & Hinton · NeurIPS 2021 · Google Brain

NAMs replace each f_j in a GAM with a neural network, allowing highly complex nonlinear feature-level functions while maintaining full interpretability. Each feature's effect can be plotted exactly as in a classic GAM. On tabular datasets, NAMs match gradient boosting accuracy while providing GAM-level interpretability — partial dependence plots that are exact, not approximate. The neural-additive-models Python package implements this.

GAMLSS: Generalized Additive Models for Location, Scale and Shape

Rigby & Stasinopoulos · Journal of the Royal Statistical Society, 2005

GAMLSS models all parameters of the response distribution simultaneously — mean, variance, skewness, and kurtosis — each as a smooth function of covariates. This captures heteroskedasticity and distributional shape changes that standard regression ignores. For example: customer spend may have higher variance on weekends or a heavier tail for new customers. The state of practice for distributional forecasting in insurance and supply chain.

Why do tree-based models still outperform deep learning on tabular data?

Grinsztajn, Oyallon & Varoquaux · NeurIPS 2022

A rigorous benchmark across 45 tabular datasets finding gradient boosting (XGBoost, LightGBM) still outperforms deep learning on tabular regression in most settings. Tree models are robust to uninformative features and handle irregular target functions naturally. Practical implication: for tabular regression use regularized linear models for interpretability, gradient boosting for accuracy, and deep learning only for very large datasets (>100K rows) with sufficient tuning budget.

Post-Selection Inference for Lasso and Related Procedures

Lee, Sun, Sun & Taylor · Annals of Statistics, 2016

When Lasso is used for feature selection and OLS is then run on selected features, the standard errors are invalid — the selection process inflates apparent significance. This paper derives exact inference procedures that account for the selection step. Practically: if you use Lasso to select and then OLS to estimate, t-tests and CIs from the OLS output are misleading. Use the selectiveInference R package or statsmodels post-selection procedures for valid inference after Lasso.

Open Research Problems

Automatic model specification: Choosing transformations, interactions, and link functions is still largely manual expertise. Symbolic regression (via genetic programming, e.g. PySR) can discover mathematical relationships from data but is not yet reliable enough to replace domain judgment.
Regression under distribution shift: A model calibrated on Q1 2023 may be poorly calibrated on Q1 2024. Invariant Risk Minimization (IRM) and related methods try to find regression coefficients stable across environments but have not yet found wide practical adoption.
Causal coefficient estimation at scale: DML handles low-dimensional treatments well. Extending it to settings with many simultaneous treatments and interference between units is an active research area.
Regression for graph-structured data: When observations are nodes in a network and errors are correlated by adjacency, standard regression SEs are wrong in ways that are hard to correct without knowing the network. Spatial econometrics and network regression methods exist but are not yet mainstream in industry.

Recommended Library Stack for 2026

Task	Library	Key Function
OLS + diagnostics + robust SEs	statsmodels	`sm.OLS().fit(cov_type='HC3')`
Regularized regression	scikit-learn	`ElasticNetCV`, `LassoCV`, `RidgeCV`
GLMs (Gamma, Poisson, NegBin)	statsmodels	`sm.GLM(family=Gamma())`
Quantile regression	statsmodels	`sm.QuantReg().fit(q=0.9)`
GAMs	pygam	`LinearGAM(s(0)+s(1)).gridsearch()`
Conformal prediction intervals	MAPIE	`MapieRegressor`
Double Machine Learning	DoubleML	`DoubleMLPLR` (partially linear)
Neural Additive Models	neural-additive-models	`NAMClassifier`, `NAMRegressor`

🧠 Chapter 3 — Final Questions

You are building a model to predict insurance claim amounts for 50,000 policies. 70% have zero claims; the rest are log-normally distributed. Walk through your complete modeling strategy. What model(s) would you use, and why?
A colleague fits OLS on customer revenue data and gets R²=0.81 training, 0.79 test. They declare the model production-ready. What four specific things would you check before agreeing, and how would you check each one?
You need to estimate the causal effect of an email campaign on purchase probability, controlling for 150 behavioral features. Why can't you just run logistic regression with all 150 features as controls? When does Double Machine Learning become necessary?
Your GAM has EDF=8.3 for the days_since_signup feature. Plotting the partial dependence shows the effect is positive up to day 30, then flat, then positive again around day 90 and day 365. What business process might explain this pattern, and how would you communicate it to a non-technical stakeholder?

Hint for Q1: Two-part (hurdle) model. Part 1: logistic regression to predict P(claim > 0). Part 2: Gamma GLM with log link to predict E[claim | claim > 0]. Final prediction: P(claim > 0) × E[claim | claim > 0]. This is the industry standard in non-life insurance actuarial modeling (Tweedie distributions are a related single-model approach).