Regression & Predictive Modeling
Beyond the Basics
Regression is the workhorse of predictive analytics. Most practitioners know how to fit a line. Far fewer understand when that line is wrong, why it fails, and how to fix it. This chapter covers the complete theory, every assumption, all major violations, and production-grade implementation across five regression families.
What Regression Is Actually Doing
Regression finds the function f such that E[Y | X] = f(X) — the expected value of Y given X. That is all regression is. Every technique we add (regularization, polynomial terms, link functions) is in service of estimating this conditional expectation more accurately or under weaker assumptions.
Most practitioners treat regression as a black box. They call LinearRegression().fit(X, y), look at R², and move on. This produces models that appear to work on training data but fail silently in production — because the assumptions that make OLS the best possible estimator were never checked, and in real business data they are routinely violated.
Linear regression makes a specific claim: the relationship between X and Y is linear, additive, and errors are random noise with constant variance. When that claim is approximately true, OLS is the Best Linear Unbiased Estimator (Gauss-Markov theorem). When it is false, OLS gives systematically wrong answers with no warning. Your job is to verify the claim holds — or to use a method that makes weaker claims.
3.1 OLS — The Foundation
OLS minimizes the sum of squared residuals. Why squared? It penalizes large errors disproportionately, produces a unique closed-form solution, and equals the maximum likelihood estimator under Gaussian errors.
The Gauss-Markov Theorem — Five Assumptions
Under these five conditions, OLS is BLUE: the Best Linear Unbiased Estimator. Violating any condition degrades the estimator in a specific, predictable way.
| # | Assumption | What It Means | Consequence of Violation | How to Test |
|---|---|---|---|---|
| GM1 | Linearity | E[Y|X] is linear in X | Systematic bias; non-random residual patterns | Residual vs fitted plot; RESET test |
| GM2 | No perfect multicollinearity | No predictor is exact linear combo of others | X'X is singular; coefficients undefined | VIF per feature |
| GM3 | Strict exogeneity | E[ε|X] = 0 (no omitted variable bias) | Biased and inconsistent coefficients | Domain knowledge; instrumental variables |
| GM4 | Homoskedasticity | Var(ε|X) = σ² constant | Wrong standard errors; invalid CIs and t-tests | Breusch-Pagan test; scale-location plot |
| GM5 | No autocorrelation | Cov(εi, εj) = 0 | SEs underestimated; spurious significance | Durbin-Watson; ACF plot of residuals |
3.2 The Four Diagnostic Plots — Your Most Important Tool
Every regression analysis must include residual diagnostics. These four plots tell you exactly which assumption is failing and how to fix it. Never skip them.
3.3 Multicollinearity — The Hidden Problem
Multicollinearity occurs when predictors are highly correlated. It does not bias coefficient estimates but inflates their variance dramatically — making them unstable, sensitive to small data changes, and statistically insignificant even when the feature is genuinely important.
Suppose you have age and years_of_experience (r=0.95). The model cannot separate which one deserves the coefficient — it can put all weight on age and none on experience, or vice versa, and get roughly the same predictions. Both coefficients become unstable and may flip signs with one additional observation. The predictions may be fine. The coefficients are meaningless. When interpretability is required (credit scoring, healthcare, regulatory compliance) this is a serious problem.
3.4 Regularization — The Solution to Overfitting
Regularization adds a penalty to the loss function that discourages large coefficients. This introduces a small bias but reduces variance substantially — the classic bias-variance tradeoff. The result generalizes better to new data.
The geometry explains it. The L2 penalty is a sphere — smooth, no corners. When the loss ellipse touches the sphere, it touches at a smooth point where no coefficient is exactly zero. The L1 penalty is a diamond with corners on the axes. The loss ellipse almost always touches those corners first, which drives that coefficient to exactly zero. This is why Lasso produces automatic feature selection and Ridge does not.
3.5 Heteroskedasticity — When Error Variance Is Not Constant
Heteroskedasticity means the variance of errors depends on X. In business data this is the rule, not the exception. Revenue prediction errors are larger for high-revenue customers. Ignoring this produces wrong standard errors, invalidating all t-tests and confidence intervals.
| Pattern | Common Cause | Correct Fix |
|---|---|---|
| Variance ∝ E[Y]² | Multiplicative process (revenue, income) | Log-transform Y; use log-linear model or Gamma GLM |
| Variance ∝ E[Y] | Count data, proportions | WLS with weights 1/ŷ; Poisson or Gamma GLM |
| Different variance per group | Different subpopulations | Cluster-robust standard errors (HC3 by cluster) |
| Unknown structure | General business data | White HC3 robust standard errors (default choice) |
3.6 Regression Family: When to Use Each
Polynomial Regression
Add polynomial terms when the relationship is nonlinear but smooth. Key risk: high-degree polynomials oscillate wildly at boundaries (Runge's phenomenon). Prefer splines or GAMs over degree >3 polynomials for smooth nonlinear relationships.
Quantile Regression
OLS models the mean of Y given X. Quantile regression models any percentile. Use it when you care about tail behavior (worst-case delivery time, 95th percentile latency, Value at Risk), when data is heavily skewed and log-transform doesn't help, or when different user segments have heterogeneous effects.
Generalized Linear Models (GLMs)
GLMs extend OLS to non-normal response variables through a link function that connects the linear predictor to E[Y|X].
| Response Type | Distribution | Link | Business Example |
|---|---|---|---|
| Binary (0/1) | Binomial | Logit | Churn, click, default, fraud |
| Count (≥0) | Poisson | Log | Support tickets, purchases, events per day |
| Overdispersed count | Negative Binomial | Log | Purchases per customer (variance > mean) |
| Positive continuous | Gamma | Log | Revenue, LTV, claim amounts, duration |
| Rate (0 to 1) | Beta | Logit | Conversion rate, market share, CTR |
| Zero + continuous | Zero-Inflated (two-part) | Mixed | App purchases (most = 0) |
Generalized Additive Models (GAMs)
GAMs replace linear terms βjXj with smooth functions fj(Xj) fit using splines. This provides nonlinear flexibility while maintaining full interpretability — you can plot each fj to see exactly how each feature affects the outcome. Best for >1000 observations per feature and relationships known to be nonlinear.
3.7 The Bias-Variance Tradeoff
| Model | Bias | Variance | Train Error | Test Error |
|---|---|---|---|---|
| Too simple (underfit) | High | Low | High | High |
| Just right | Low | Low | Low | Low |
| Too complex (overfit) | Low | High | Very low | High |
OLS: Continuous Y, approximately linear relationship, need coefficient interpretation. Always use HC3 robust SEs. Ridge: Many features, possible multicollinearity, need all features retained. Lasso/Elastic Net: Believe most features are irrelevant, want automatic selection. GLM: Non-normal response (binary, count, positive continuous). Quantile: Tail prediction, heteroskedastic data, risk estimation. GAM: Known nonlinear relationships, interpretability required, enough data.
Regression Across Industries
3.8 E-Commerce — Customer Lifetime Value Prediction
Predicting 12-month revenue from each customer to allocate acquisition spend. One of the highest-value regression problems in business.
Why naive OLS fails:
- LTV is right-skewed — most customers contribute very little, a few enormously. OLS is distorted by high-LTV outliers
- LTV is bounded below at zero. OLS can and will predict negative values
- Variance of errors is proportional to predicted value — heteroskedasticity is guaranteed
Production approach:
- Two-part model: Part 1 classifies zero-LTV vs positive-LTV (logistic). Part 2 models log(LTV | LTV > 0) via Gamma GLM. Multiply: P(LTV > 0) × E[LTV | LTV > 0]
- Gamma GLM with log link: Naturally handles positive right-skewed data, heteroskedasticity is in the likelihood, predictions always positive
- Quantile regression at median: More robust than mean for skewed distributions, less distorted by whale customers
Spotify, Netflix, and DoorDash use Gamma GLM with log link for LTV. The Gamma distribution naturally handles positive continuous right-skewed data, variance proportional to mean² is built into the likelihood, and the log link ensures always-positive predictions. No need for manual log-transform and no smearing correction required.
3.9 Real Estate — Hedonic Property Valuation
Predicting property prices from features (size, location, age, amenities).
- Spatial autocorrelation: Nearby properties have correlated errors (GM5 violation). Standard SEs are underestimated. Use Moran's I test; correct with spatial lag models or cluster-robust SEs by neighborhood
- Log-linear specification (hedonic pricing): Price = exp(β0 + β1×log(size) + ...). Coefficients are elasticities: "1% increase in size increases price by β1%"
- Interaction terms essential: The value of an extra bedroom depends heavily on neighborhood. OLS without the interaction will average across neighborhoods and mislead
- GAM for age: The age-price relationship is often U-shaped (new is premium, mid-age discounted, historic is premium). Linear terms cannot capture this; GAMs handle it naturally
3.10 Healthcare — Hospital Length of Stay
Predicting days in hospital for capacity planning.
- Length of stay is a count variable (integer ≥1) → use Negative Binomial GLM, not OLS
- Strong right skew: most stays 2–5 days, some 30+ days. A few complex cases dominate the mean — OLS predictions for typical patients are biased upward
- Censoring problem: Some patients are still admitted when data is pulled — their true LOS is unknown. Standard OLS ignores this. Use survival models (Cox regression) or Tobit regression for censored data
3.11 Finance — Probability of Loan Default
Binary outcome (default yes/no) requires logistic regression, not OLS.
- Severe class imbalance: 1–5% default rate. Naive logistic regression predicts "no default" for everyone, achieving 95%+ accuracy while being completely useless. Use class weights or threshold tuning
- Regulatory requirement for monotone WoE features: each feature must have monotonic relationship with default probability for auditability
- The scorecard transformation: Model coefficients are converted into integer points per feature so loan officers can compute scores manually. Still used because regulators can audit it completely
3.12 Ride-sharing — Dynamic Pricing
Predicting the surge multiplier to balance supply and demand in real-time.
- Target: positive continuous multiplier → Gamma GLM or log-linear model
- Surge 5 minutes ago predicts surge now better than almost any other feature — temporal autocorrelation is severe. Must include lagged features and use time-series-aware cross-validation
- Quantile regression at 90th–95th percentile estimates worst-case surge for customer-facing communication ("expect up to 3.5x surcharge")
- You need to establish causality: Regression on observational data estimates correlation, not causation. Use Ch 7 (Causal Inference)
- Your outcome is a category: Use classification (Ch 4). Linear regression on binary outcomes produces values outside [0,1]
- Structural breaks exist: If the X-Y relationship changes at a threshold or over time, a single model averages the regimes and misleads both
- Complex interactions across many features: Gradient boosting (Ch 4) will outperform linear regression significantly. Use regression when interpretability matters more than accuracy
Production Regression Toolkit
3.13 Complete Regression Diagnostics
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, linear_reset
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.outliers_influence import variance_inflation_factor
def full_regression_diagnostics(X, y, model_name="OLS"):
"""
Run all five Gauss-Markov assumption tests.
Returns: dict of pass/fail per test + recommended fixes.
Requires: X as pandas DataFrame with column names.
"""
X_c = sm.add_constant(X)
model = sm.OLS(y, X_c).fit()
residuals = model.resid
n, p = X.shape
report = {}
# 1. Linearity: Ramsey RESET test
# H0: model correctly specified. p > 0.05 means OK.
reset = linear_reset(model, power=2)
report["linearity_ok"] = reset.pvalue > 0.05
report["linearity_p"] = reset.pvalue
# 2. Homoskedasticity: Breusch-Pagan test
# H0: constant variance. p > 0.05 means OK.
_, bp_p, _, _ = het_breuschpagan(residuals, X_c)
report["homosked_ok"] = bp_p > 0.05
report["homosked_p"] = bp_p
# 3. Normality of residuals (mainly for small n)
# For n > 500, CLT makes this less critical.
sample = residuals.sample(min(5000, n), random_state=42)
_, norm_p = stats.shapiro(sample)
report["normality_ok"] = norm_p > 0.05 or n > 500
report["normality_p"] = norm_p
# 4. No autocorrelation: Durbin-Watson
# Value near 2 = no autocorrelation. <1.5 or >2.5 is a concern.
dw = durbin_watson(residuals)
report["autocorr_ok"] = 1.5 < dw < 2.5
report["durbin_watson"] = dw
# 5. Multicollinearity: VIF per feature
vif = pd.DataFrame({
"feature": X.columns,
"VIF": [variance_inflation_factor(X_c.values, i+1)
for i in range(p)]
}).sort_values("VIF", ascending=False)
report["max_vif"] = vif["VIF"].max()
report["multicollin_ok"] = report["max_vif"] < 10
report["vif_table"] = vif
# 6. Influential points: Cook's Distance > 4/n
cooks_d = model.get_influence().cooks_distance[0]
n_influe = (cooks_d > 4/n).sum()
report["influential_ok"] = n_influe < n * 0.05
report["n_influential"] = n_influe
# Print readable summary
checks = [
("Linearity (RESET)", report["linearity_ok"],
f"p={report['linearity_p']:.3f}", "Add polynomial terms or use GAM"),
("Homoskedasticity (BP)", report["homosked_ok"],
f"p={report['homosked_p']:.3f}", "Use HC3 robust SEs, WLS, or log(Y)"),
("Normality (Shapiro-Wilk)", report["normality_ok"],
f"p={report['normality_p']:.3f}", "Check outliers; OK if n>500"),
("No autocorrelation (DW)", report["autocorr_ok"],
f"DW={report['durbin_watson']:.2f}", "Add lagged features; use HAC SEs"),
("No multicollinearity", report["multicollin_ok"],
f"maxVIF={report['max_vif']:.1f}", "Remove correlated features or use Ridge"),
("No influential outliers", report["influential_ok"],
f"n={report['n_influential']}", "Investigate Cook's D > 4/n points"),
]
print("="*58); print(f" DIAGNOSTICS: {model_name}"); print("="*58)
for name, ok, stat, fix in checks:
icon = "PASS" if ok else "FAIL"
print(f" [{icon}] {name:<30} {stat}")
if not ok: print(f" Fix: {fix}")
print(f"\n R2={model.rsquared:.3f} AdjR2={model.rsquared_adj:.3f} AIC={model.aic:.1f}")
return report, model
# Usage
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
report, model = full_regression_diagnostics(housing.data, housing.target)
# Expected: FAIL linearity (relationships are nonlinear)
# FAIL homoskedasticity (variance grows with price)
# Both tell you: log-transform Y or use Gamma GLM
3.14 Regularized Regression with Cross-Validated Lambda
import numpy as np
import pandas as pd
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42)
alphas = np.logspace(-4, 4, 100) # search 0.0001 to 10000
models = {
"Ridge": RidgeCV(alphas=alphas, cv=5),
"Lasso": LassoCV(alphas=alphas, cv=5, max_iter=10000),
"ElasticNet": ElasticNetCV(
alphas=alphas,
l1_ratio=[.1, .5, .7, .9, 1.0], # CV picks best ratio too
cv=5, max_iter=10000)
}
# IMPORTANT: always scale before regularization.
# Without scaling, features with large magnitude get penalized more
# just because of their units, not their actual importance.
print(ff"{'Model':12} {'RMSE':>8} {'R2':>7} {'Best alpha':>12} {'Non-zero':>10}")
print("-"*55)
for name, reg in models.items():
pipe = Pipeline([("scale", StandardScaler()), ("reg", reg)])
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
rmse = np.sqrt(np.mean((y_te - y_pred)**2))
r2 = 1 - np.mean((y_te - y_pred)**2) / np.var(y_te)
alpha_best = getattr(reg, 'alpha_', 'N/A')
nz = (np.abs(reg.coef_) > 1e-6).sum()
print(ff"{name:12} {rmse:8.4f} {r2:7.3f} {str(alpha_best):>12} {nz:>10}/{X_tr.shape[1]}")
# Lasso feature selection: which features survived?
lasso_coefs = pd.Series(models["Lasso"].coef_, index=X_tr.columns)
survived = lasso_coefs[lasso_coefs != 0].sort_values(key=abs, ascending=False)
print(f"\nLasso kept {len(survived)} features (sorted by |coefficient|):")
print(survived.round(4))
3.15 Robust Standard Errors — HC3 by Default
import statsmodels.api as sm
import pandas as pd
def regression_with_robust_ses(X, y):
"""
Fit OLS and compare naive vs HC3-robust standard errors.
Recommendation: use HC3 by default for all cross-sectional data.
HC3 is conservative (slightly wider CIs) but always valid.
Naive OLS SEs are only valid under exact homoskedasticity,
which almost never holds in practice.
SE types available in statsmodels:
'HC0' - White (1980): original robust SE
'HC1' - small-sample correction
'HC3' - default recommendation: better for small n, most conservative
'HAC' - Newey-West: also corrects for autocorrelation
"""
X_c = sm.add_constant(X)
naive = sm.OLS(y, X_c).fit()
robust = sm.OLS(y, X_c).fit(cov_type='HC3')
comparison = pd.DataFrame({
"coef": naive.params,
"SE_naive": naive.bse,
"SE_HC3": robust.bse,
"SE_ratio": robust.bse / naive.bse, # >1 means naive SEs too small
"sig_naive": naive.pvalues < 0.05,
"sig_robust": robust.pvalues < 0.05,
})
# Most important: which results change significance?
changed = comparison[comparison["sig_naive"] != comparison["sig_robust"]]
if len(changed) > 0:
print("WARNING: significance changes with robust SEs:")
print(changed[["coef","SE_naive","SE_HC3","sig_naive","sig_robust"]])
print("These results were false positives with naive SEs.")
else:
print("No significance changes. Robust SEs confirm all results.")
return robust, comparison
# Practical rule: if Breusch-Pagan p < 0.05, MUST use HC3.
# In practice: just always use HC3. The cost is near-zero.
3.16 Gamma GLM for Revenue and LTV
import statsmodels.api as sm
import numpy as np
import pandas as pd
def fit_gamma_glm(X, y):
"""
Gamma GLM with log link for positive continuous right-skewed Y.
Why this is better than log(Y) + OLS:
1. Predictions always positive (enforced by log link)
2. Heteroskedasticity (var proportional to mean^2) is built in
3. No smearing correction needed when interpreting predictions
4. Coefficients: exp(beta_j) = multiplicative effect on E[Y]
e.g. exp(0.15) = 1.16 means feature increases Y by 16%
"""
X_c = sm.add_constant(X)
glm = sm.GLM(
y, X_c,
family=sm.families.Gamma(link=sm.families.links.Log())
).fit()
coef_df = pd.DataFrame({
"coef": glm.params,
"exp_coef": np.exp(glm.params), # multiplicative effect
"pvalue": glm.pvalues,
"significant": glm.pvalues < 0.05
}).sort_values("pvalue")
print(f"Gamma GLM Results")
print(f"Pseudo R2 (deviance): {1 - glm.deviance/glm.null_deviance:.4f}")
print(f"AIC: {glm.aic:.2f}")
print(f"Dispersion: {glm.scale:.4f}")
print(f"\nCoefficients (exp(coef) = multiplicative effect on Y):")
print(coef_df.round(4))
return glm, coef_df
# Interpreting Gamma GLM coefficients:
# coef = 0.12 for feature X means:
# When X increases by 1 unit, E[Y] is multiplied by exp(0.12) = 1.127
# i.e., Y increases by 12.7%
# This is the log-linear interpretation — natural for business metrics.
3.17 Quantile Regression — Modeling the Tails
import statsmodels.api as sm
import pandas as pd
import numpy as np
def quantile_regression_suite(X, y, quantiles=None):
"""
Fit quantile regression at multiple quantiles.
Use cases:
- q=0.5 (median): robust alternative to OLS mean
- q=0.9: worst-case prediction (delivery SLA, load estimates)
- Comparing coefficients across quantiles reveals heterogeneous effects:
if beta_j changes a lot from q=0.1 to q=0.9, feature j has
different effects for low vs high Y values.
"""
if quantiles is None:
quantiles = [0.10, 0.25, 0.50, 0.75, 0.90]
X_c = sm.add_constant(X)
results = {}
for q in quantiles:
qr = sm.QuantReg(y, X_c).fit(q=q)
results[q] = qr.params
coef_table = pd.DataFrame(results).T
coef_table.index.name = "quantile"
print("Coefficients across quantiles:")
print(coef_table.round(4))
# Heterogeneous effect detection: large spread = effect varies by Y level
spread = coef_table.max() - coef_table.min()
het = spread[spread > coef_table.abs().mean() * 0.5]
if len(het) > 0:
print(f"\nHeterogeneous effects detected for: {list(het.index)}")
print("These features affect low and high Y values differently.")
return coef_table
# Practical note: quantile regression minimizes the pinball loss
# L(q, y, y_hat) = q*(y-y_hat) if y>=y_hat, else (1-q)*(y_hat-y)
# This asymmetric loss is why predictions converge to quantile q.
3.18 The Log-Transform and Smearing Correction
import numpy as np
from sklearn.linear_model import LinearRegression
def log_linear_with_smearing(X_train, y_train, X_test):
"""
Log-linear model with Duan's smearing correction.
THE MISTAKE everyone makes:
model.fit(X, log(y))
predictions = exp(model.predict(X_test)) # WRONG
WHY IT'S WRONG:
E[Y | X] != exp(E[log(Y) | X])
exp(E[log Y]) = median(Y), not mean(Y)
For right-skewed Y, mean >> median
So naive back-transform systematically under-predicts the mean
CORRECT (Duan 1983):
E[Y | X] = exp(fitted_log_y) * E[exp(residuals)]
The factor E[exp(residuals)] is the smearing correction.
"""
log_y = np.log(y_train)
model = LinearRegression().fit(X_train, log_y)
log_pred_train = model.predict(X_train)
residuals = log_y - log_pred_train
# Duan smearing factor: nonparametric, works for any residual distribution
smearing_factor = np.mean(np.exp(residuals))
log_pred_test = model.predict(X_test)
y_pred_correct = np.exp(log_pred_test) * smearing_factor
y_pred_naive = np.exp(log_pred_test) # wrong for mean prediction
y_pred_median = np.exp(log_pred_test) # actually correct for median
print(f"Smearing factor: {smearing_factor:.4f}")
print(f"Mean prediction (naive): {y_pred_naive.mean():.2f} <-- too low")
print(f"Mean prediction (correct): {y_pred_correct.mean():.2f} <-- use this")
print(f"If smearing factor is 1.0: residuals are symmetric, naive is fine")
return y_pred_correct, smearing_factor
3.19 GAM — Nonlinear and Interpretable
# pip install pygam
from pygam import LinearGAM, s
import numpy as np
import matplotlib.pyplot as plt
def fit_and_interpret_gam(X, y, feature_names, n_splines=20):
"""
GAM: E[Y|X] = intercept + f1(X1) + f2(X2) + ... + fp(Xp)
Each f_j is a smooth spline fit by penalized regression.
Key diagnostic: Effective Degrees of Freedom (EDF) per term.
EDF = 1: linear relationship (f_j is just a line)
EDF = 2-3: mild nonlinearity
EDF > 4: strong nonlinearity (U-shape, thresholds, etc.)
"""
X = np.array(X); p = X.shape[1]
terms = sum(s(i, n_splines=n_splines) for i in range(p))
gam = LinearGAM(terms).gridsearch(X, y) # GCV finds optimal smoothing
print("GAM Results:")
print(f"Pseudo R2: {gam.statistics_['pseudo_r2']['explained_deviance']:.4f}")
print(f"AIC: {gam.statistics_['AIC']:.2f}")
print(f"\nEffective degrees of freedom per feature:")
edfs = gam.statistics_['edof_per_coef']
for i, name in enumerate(feature_names):
edf = edfs[i] # approximate; exact slicing depends on pygam version
status = "LINEAR" if edf < 2 else "NONLINEAR"
print(f" {name:25} EDF={edf:.2f} ({status})")
# Plot partial dependence for each feature
fig, axes = plt.subplots(1, p, figsize=(4*p, 3), sharey=False)
for i, (ax, name) in enumerate(zip(axes, feature_names)):
XX = gam.generate_X_grid(term=i)
pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)
ax.plot(XX[:, i], pdep, lw=2, color='#8B2C2C')
ax.fill_between(XX[:, i], confi[:,0], confi[:,1], alpha=0.15, color='#8B2C2C')
ax.axhline(0, color='gray', lw=0.8, linestyle='--')
ax.set_xlabel(name); ax.set_ylabel("f(feature)")
plt.tight_layout()
return gam
- You run
full_regression_diagnostics()and get FAIL on homoskedasticity (BP p=0.001), PASS on everything else. Your business requires valid confidence intervals. What are your three options in priority order? - Lasso eliminates 12 of 30 features (RMSE=0.42). Ridge keeps all 30 (RMSE=0.41). Which do you deploy and why? Under what circumstances would you change your answer?
- You predict customer revenue using log(revenue) + OLS. Predictions seem systematically too low by about 20%. Your colleague says "just multiply predictions by 1.2." What is the actual problem, and what is the principled fix?
- A GAM returns EDF=7.8 for the
customer_agefeature. What does this tell you about the age-revenue relationship? Could a linear regression or even a 2nd-degree polynomial capture this?
The Cutting Edge of Regression and Prediction
Double Machine Learning — Causal Coefficients at Scale
Classical regression coefficients in observational data are not causal — they are contaminated by confounders. Double Machine Learning uses ML to partial out confounders and produces coefficient estimates that are both causal and valid in high dimensions.
Conformal Prediction — Valid Intervals Without Assumptions
Neural Additive Models — Deep Learning with GAM Interpretability
Open Research Problems
- Automatic model specification: Choosing transformations, interactions, and link functions is still largely manual expertise. Symbolic regression (via genetic programming, e.g. PySR) can discover mathematical relationships from data but is not yet reliable enough to replace domain judgment.
- Regression under distribution shift: A model calibrated on Q1 2023 may be poorly calibrated on Q1 2024. Invariant Risk Minimization (IRM) and related methods try to find regression coefficients stable across environments but have not yet found wide practical adoption.
- Causal coefficient estimation at scale: DML handles low-dimensional treatments well. Extending it to settings with many simultaneous treatments and interference between units is an active research area.
- Regression for graph-structured data: When observations are nodes in a network and errors are correlated by adjacency, standard regression SEs are wrong in ways that are hard to correct without knowing the network. Spatial econometrics and network regression methods exist but are not yet mainstream in industry.
Recommended Library Stack for 2026
| Task | Library | Key Function |
|---|---|---|
| OLS + diagnostics + robust SEs | statsmodels | sm.OLS().fit(cov_type='HC3') |
| Regularized regression | scikit-learn | ElasticNetCV, LassoCV, RidgeCV |
| GLMs (Gamma, Poisson, NegBin) | statsmodels | sm.GLM(family=Gamma()) |
| Quantile regression | statsmodels | sm.QuantReg().fit(q=0.9) |
| GAMs | pygam | LinearGAM(s(0)+s(1)).gridsearch() |
| Conformal prediction intervals | MAPIE | MapieRegressor |
| Double Machine Learning | DoubleML | DoubleMLPLR (partially linear) |
| Neural Additive Models | neural-additive-models | NAMClassifier, NAMRegressor |
- You are building a model to predict insurance claim amounts for 50,000 policies. 70% have zero claims; the rest are log-normally distributed. Walk through your complete modeling strategy. What model(s) would you use, and why?
- A colleague fits OLS on customer revenue data and gets R²=0.81 training, 0.79 test. They declare the model production-ready. What four specific things would you check before agreeing, and how would you check each one?
- You need to estimate the causal effect of an email campaign on purchase probability, controlling for 150 behavioral features. Why can't you just run logistic regression with all 150 features as controls? When does Double Machine Learning become necessary?
- Your GAM has EDF=8.3 for the
days_since_signupfeature. Plotting the partial dependence shows the effect is positive up to day 30, then flat, then positive again around day 90 and day 365. What business process might explain this pattern, and how would you communicate it to a non-technical stakeholder?