Chapter 02 Part I · Foundations

Exploratory Data Analysis
& Feature Engineering

Most models fail not because of the algorithm — they fail because of bad data understanding and poor feature design. This chapter covers the systematic craft of knowing your data deeply and transforming it into signals a model can actually use.

📖 ~65 min reading
💻 14 code examples
🔬 5 research papers
🧠 5 core concepts

EDA Is Not Plotting. Feature Engineering Is Not Intuition.

Most people think EDA means making histograms until something looks interesting. That's not EDA — that's wandering. Real EDA is a systematic investigation with specific questions: What is the shape of each variable? What relationships exist? What assumptions does my intended model make, and does my data violate them? Where is the data wrong?

Feature engineering is similarly misunderstood as "adding new columns." The real definition: feature engineering is the process of encoding domain knowledge and data structure into a representation that makes the learning problem easier. A well-engineered feature can make a linear model outperform a deep neural network.

🔑 The Single Most Important Insight

Every machine learning algorithm is a function approximator. Its job is to find a mapping from features to targets. Feature engineering reshapes the input space so that the mapping becomes simpler — ideally linear. A nonlinear relationship in raw space might be perfectly linear in log-space. A cyclical feature (hour of day) that confuses a tree might be trivially separable when encoded as sin/cos components. The model's job gets easier every time you make the underlying structure more explicit.

2.1 The EDA Framework — Systematic, Not Random

Effective EDA always answers questions in this order. Skipping steps produces missed problems that destroy models later.

1
Understand the data generation process

Before touching the data: how was it collected? What does each row represent? What could go wrong at collection time? A clickstream row is not the same as a transaction row. Sensor data from a factory has different failure modes than survey data. The answers determine everything else.

2
Univariate analysis — understand each variable alone

Distribution shape (normal? skewed? bimodal?), range, cardinality for categoricals, missing rate, outlier presence. The goal is not to look at pretty plots — it's to identify every variable that needs transformation before modeling.

3
Bivariate analysis — understand relationships

Feature-target relationships (which features actually predict your target?), feature-feature correlations (multicollinearity), conditional distributions (does the relationship between X and Y change based on a third variable Z?). This is where you find your most powerful features.

4
Data quality audit — find what's wrong

Missing values (MCAR, MAR, MNAR — they require different treatments), duplicates, impossible values (negative ages, future dates), inconsistent encoding (same category spelled multiple ways), unit inconsistencies, temporal leakage.

5
Multivariate analysis — understand the joint structure

Cluster structure in the feature space, class overlap for classification tasks, interaction effects between features, temporal trends if data has a time dimension. Dimensionality reduction (PCA, UMAP) helps visualize this.

2.2 Distributions in Depth — Reading the Shape

A histogram tells you the story of your data if you know how to read it. Each shape implies specific modeling decisions.

Shape PatternWhat It ImpliesWhat To Do
Right-skewed (long right tail) Multiplicative process, bounded below at 0. Revenue, income, counts, durations. Log transform or Box-Cox. Use median not mean for summary stats. Consider log-normal or gamma models.
Bimodal (two humps) Two subpopulations mixed. E.g. users who open emails within 1 min or not at all. Find and encode the grouping variable. Consider modeling the two groups separately. Mixture models.
Uniform distribution Often an ID or hash column disguised as a numeric. Or a truly uninformative feature. Check if it's actually an ID. If genuinely uniform and numeric, it may have weak predictive power.
Spike at zero + continuous tail Zero-inflated data. Many users spend $0; those who spend anything are log-normal. Two-part model: classify zero vs non-zero, then model the non-zero part separately.
Discrete with large gaps Manual data entry, rounding, or ordinal masquerading as numeric. Treat as ordinal or categorical. Check if gaps are meaningful or artifacts.
Heavy-tailed (extreme outliers) Power-law or Pareto process. Social media followers, city sizes, wealth. Winsorize or clip at a quantile for regression. Log transform. Understand if outliers are real.

2.3 Missing Data — Three Mechanisms, Three Treatments

Missing data is not a problem to be "fixed" by deleting rows or filling with the mean. It is a phenomenon that carries information about why data is missing. Getting this wrong causes biased models.

Missing Data Mechanisms: MCAR — Missing Completely At Random The probability of being missing is unrelated to any variable. Example: Lab equipment randomly fails regardless of patient condition. Test: Little's MCAR test (chi-square test on missing patterns) Treatment: Complete-case analysis valid. Mean/median imputation OK. Frequency: Rare in practice. Usually an optimistic assumption. MAR — Missing At Random Missingness depends on OBSERVED variables, not the missing value itself. Example: Income is more often missing for younger respondents (age observed). Test: Compare missing vs non-missing groups on observed variables. Treatment: Multiple imputation (MICE), model-based imputation. Frequency: Common. Most statistical software assumes MAR. MNAR — Missing Not At Random Missingness depends on the MISSING VALUE ITSELF. Example: High earners refuse to report income (income causes missingness). Example: Sick patients miss follow-up appointments. Test: Cannot be verified from data alone — requires domain knowledge. Treatment: Selection models, sensitivity analysis, domain knowledge. Frequency: Common in practice, often ignored. Causes systematic bias.
⚠️ The Mean Imputation Disaster

Mean/median imputation is almost always wrong and causes three specific problems: it shrinks variance (all imputed values are identical, reducing variability), it distorts correlations (replacing missing values with the mean breaks the relationship between that variable and others), and it creates impossible values (imputing mean age = 34.7 when age should be an integer, or mean income in a bimodal distribution falling between the two modes where nobody actually is).

Use instead: MICE (Multiple Imputation by Chained Equations) for MAR data. Add a binary missingness indicator as a feature regardless of imputation method — the fact that a value was missing is itself informative.

2.4 Outliers — Signal or Noise?

The most dangerous question in analytics is: "Should I remove this outlier?" The answer depends entirely on why the value is extreme, which requires understanding the data generation process.

Outliers That Are Errors — Remove/Fix
  • Age = 999 (data entry error)
  • Revenue = -50,000 (sign error)
  • Date = year 2099 (system default)
  • Duplicate rows from JOIN explosion
  • Test/bot traffic in production logs
  • Sensor reading during calibration
Outliers That Are Real — Keep/Model
  • A whale customer spending 100× average
  • A viral post with 1M views vs normal 100
  • A fraudulent transaction (the whole point)
  • A crash event in safety-critical systems
  • Record-breaking stock market day
  • A medical patient with rare condition
✅ The Right Process for Outliers

1. Detect using IQR fences (Q3 + 1.5×IQR), Z-score (>3σ), or Isolation Forest for multivariate outliers. 2. Investigate each one — do not automate removal. 3. Document your decision and the reasoning. 4. If keeping for modeling, use robust methods (Huber loss, quantile regression, tree-based models) that are insensitive to outliers rather than deleting real data. 5. Winsorize (cap at 99th percentile) as a middle ground when outliers are real but extreme enough to destabilize linear models.

2.5 Feature Engineering — The Five Families

Feature engineering techniques fall into five categories. A deep practitioner has all five in their toolkit and knows when each is appropriate.

Family 1: Numerical Transformations

Reshaping the distribution of continuous variables to satisfy model assumptions or make relationships linear.

TransformFormulaUse WhenInverse
Loglog(x) or log1p(x)Right-skewed positive data. Makes multiplicative relationships additive.exp(x)
Square Root√xCount data (Poisson). Milder than log. Works on zeros.
Box-Cox(xᵅ-1)/λOptimal normalization — finds the best λ from data.Parametric
Yeo-JohnsonExtended Box-CoxLike Box-Cox but handles negative values and zeros.Parametric
Rank / QuantileReplace with quantileWhen you need uniform or normal marginal distributions. Robust to outliers.None
WinsorizationClip at p/1-p quantilesReduce outlier influence without removing data.None
BinningCut into intervalsWhen relationship is step-wise, not smooth. Handles non-monotonic effects.None

Family 2: Categorical Encoding

Converting categorical variables into numerical representations. The choice of encoding significantly impacts model performance and is highly model-dependent.

EncodingOutputBest ForAvoid When
One-Hot (OHE)k-1 binary columnsLow-cardinality nominals, linear models, neural netsHigh cardinality (>50 categories) — curse of dimensionality
Label / OrdinalInteger 0 to k-1Ordinal categories (low/med/high), tree modelsNominal categories with no order — implies false ordering
Target EncodingMean target per categoryHigh-cardinality in tree models (cities, products, users)Without cross-validation — causes massive target leakage
Count / FrequencyCategory frequencyWhen frequency itself is informative (popular vs rare)When frequency correlates with target spuriously
Binary / Hashk binary columnsVery high cardinality, memory constraintsWhen collision rate is unacceptable for your cardinality
EmbeddingsDense vector (8-32 dim)Neural nets, very high cardinality (user IDs, product IDs)Small datasets — not enough data to learn embedding quality
WoE (Weight of Evidence)Log(P(event)/P(non-event))Binary classification, credit scoring, logistic regressionMulti-class problems — WoE is inherently binary
⚠️ Target Encoding — The Leakage Trap

Target encoding replaces a category with the mean target value for that category. On the surface, this seems ideal for high-cardinality features like user IDs or zip codes. The trap: you're using the target to create your features. If done naively on the full dataset, the model trains on features that directly encode the answer — this is target leakage, and it produces spectacular train metrics and terrible production performance.

Correct approach: Always use out-of-fold target encoding. For each fold in cross-validation, compute target means on the other folds only. Libraries like category_encoders handle this correctly with the TargetEncoder class.

Family 3: Datetime Features

Time is one of the richest sources of features in business data. Raw timestamps are useless — their meaning must be extracted and encoded correctly.

FeatureWhy It MattersEncoding Note
Hour of dayTraffic, purchases, support tickets all have hourly patternsUse sin/cos: sin(2π·hour/24), cos(2π·hour/24) — encodes cyclical continuity (23:59 is close to 00:01)
Day of weekB2B vs B2C behave differently weekday vs weekendsin/cos encoding: sin(2π·dow/7)
Month / SeasonSeasonality in sales, churn, engagementsin/cos for cyclical, or one-hot if pattern is not smooth
Is weekend / holidayStrong behavioral discontinuity at these boundariesBinary flag. Use a holiday calendar library (workalendar, holidays)
Time since eventRecency: days since last purchase, last login, last support ticketRaw numeric. Often log-transform for heavy-tailed distributions
Time to eventDays until contract renewal, subscription expiry, warranty endRaw numeric or bucketed (30/60/90 day windows)
Elapsed time in periodDay 3 of month vs day 28 — payroll timing effectsNormalize: day_of_month / days_in_month → [0,1]

Family 4: Interaction & Polynomial Features

Linear models can only capture additive effects. If the effect of feature A depends on the value of feature B, you need to encode that interaction explicitly. Tree models find interactions automatically — linear models, neural nets, and SVMs do not.

Polynomial Features — Expanding the Feature Space Original features: [x₁, x₂] Degree-2 polynomial expansion: [x₁, x₂, x₁², x₁x₂, x₂²] Interaction term x₁x₂ captures: "the effect of x₁ depends on x₂" Example: price_sensitivity × income_level (high income users are less sensitive to price changes) Warning: Degree-3 with 10 features → 286 features Degree-3 with 100 features → 176,851 features Always use regularization (Lasso/Ridge) with polynomial features.

Family 5: Aggregation & Window Features

The richest features for tabular ML are often aggregations across related rows — computing behavioral summaries per entity (user, product, store) over time windows.

Window Aggregation Example: E-Commerce User Features Raw event log: user_id | event_time | event_type | amount U001 | 2025-01-15 14:22:00 | purchase | 89.99 U001 | 2025-01-18 09:45:00 | page_view | 0 U001 | 2025-01-20 16:10:00 | add_to_cart | 0 U001 | 2025-01-21 11:30:00 | purchase | 124.50 Engineered features (as of 2025-01-22): Recency: days_since_last_purchase = 1 Frequency: purchases_last_30d = 2, purchases_last_90d = 4 Monetary: total_spend_last_30d = 214.49, avg_order_value = 107.25 Behavior: page_views_last_7d = 3, add_to_cart_rate = 0.33 Trend: spend_growth_rate = (214.49 - 89.99) / 89.99 = 1.38 Sequence: days_between_purchases_avg = 6.0 This is the RFM (Recency-Frequency-Monetary) framework, the foundation of customer analytics since the 1990s.

2.6 The Leakage Problem — The Silent Model Killer

Target leakage occurs when your features contain information that would not be available at prediction time. It is the single most common cause of models that perform brilliantly in evaluation and catastrophically in production.

⚠️ Real Leakage Examples That Destroyed Production Models
  • Hospital readmission model: Included "discharge_type" as a feature. Patients who died in hospital never got readmitted — discharge_type="deceased" perfectly predicted no readmission. Not available at prediction time.
  • Fraud detection: Included "fraud_review_flag" set by the fraud team — this was set after fraud was discovered, not before. The model learned to detect when the fraud team had already found fraud.
  • Churn prediction: Included "last_contact_reason = cancellation_inquiry" — this was set when the customer called to cancel, which is exactly when they were about to churn. Available at prediction time, but encodes the target almost perfectly.
  • Loan default: Used future account balance data — the aggregation included data from months after the loan decision point.
✅ How to Prevent Leakage Systematically

Temporal discipline: For any feature that changes over time, define a strict "feature cutoff date" before the prediction date. All features must be computed using only data available before that cutoff. Treat-time simulation: Simulate exactly what data would have been available at the moment the prediction would be made in production. Suspicious AUC check: If a new feature increases AUC from 0.72 to 0.98, assume leakage first. Investigate before celebrating.

2.7 Feature Selection — Less Is Often More

Adding more features does not monotonically improve model performance. Irrelevant and redundant features add noise, increase training time, make models less interpretable, and in small datasets cause overfitting.

MethodTypeHow It WorksStrength
Correlation / Mutual InfoFilterRank features by univariate relationship with targetFast, model-agnostic. Good first pass.
Variance ThresholdFilterRemove near-zero variance featuresRemoves constants and near-constants instantly
Lasso (L1)EmbeddedRegularization drives irrelevant feature coefficients to exactly 0Automatic, handles collinear features
Feature Importance (Trees)EmbeddedAverage impurity decrease per feature across all treesCaptures non-linear relationships
Permutation ImportanceModel-agnosticShuffle each feature; measure performance dropWorks for any model, respects interactions
SHAP ValuesModel-agnosticGame-theoretic attribution of each feature's contributionMost faithful importance measure, handles correlations
RFE (Recursive Feature Elimination)WrapperIteratively remove least important features, re-trainFinds optimal subset but computationally expensive
BorutaWrapperCompare real features against randomly shuffled shadow featuresStatistically rigorous, finds all relevant features

EDA & Feature Engineering Across Industries

2.8 E-Commerce — Customer Lifetime Value Prediction

Predicting which customers will be most valuable over the next 12 months. The quality of the feature set determines almost everything about model performance.

Raw data available: order history, browsing sessions, email interactions, support tickets, demographics.

EDA reveals:

  • CLV is log-normally distributed — 80% of customers contribute 20% of revenue. Mean CLV of $240 is meaningless; median $45 is more representative
  • 53% of customers made exactly 1 purchase and never returned — the "one-and-done" segment dominates
  • Strong bimodality in session duration: 30 seconds (bounce) vs 8 minutes (engaged). Two different user types mixed in one dataset
  • Missing email open rate for 34% of customers — MNAR (non-openers have no open data), needs a separate "no_email_data" flag

Feature engineering decisions:

  • Log-transform CLV target before regression — residuals are then approximately normal
  • Create is_repeat_buyer binary flag — the one-and-done segment needs a separate model or a strong indicator
  • RFM features over 30/60/90/180 day windows — recency decay is exponential, so log(days_since_last_purchase)
  • Session depth score: pages_viewed × avg_time_per_page — single interaction captures both breadth and depth
  • Category affinity vector: normalized purchase amounts per category — who is a "shoes person" vs "electronics person"
  • Email engagement ratio: opens / sends with Laplace smoothing for low-send customers

2.9 Healthcare — Readmission Risk Prediction

Predicting 30-day hospital readmission from electronic health records (EHR). Feature engineering here directly impacts patient outcomes.

Critical EDA findings:

  • Diagnosis codes (ICD-10) have 70,000+ unique values — direct one-hot encoding is impossible. Need hierarchical grouping (CCS categories) or embedding
  • Lab values have extreme missingness (40-80%) that is MNAR — sicker patients get more tests, so missing lab = probably less sick is often wrong
  • Length of stay is right-skewed with a spike at 0 (same-day discharge) — zero-inflated, needs separate treatment
  • Temporal leakage risk: discharge medications, discharge instructions, and follow-up appointment bookings all occur after the readmission risk window starts

Feature engineering patterns specific to healthcare:

  • Comorbidity indices: Charlson Comorbidity Index and Elixhauser scores aggregate diagnosis codes into validated risk scores — domain knowledge compressed into a single feature
  • Lab trajectory: Not just the last lab value but the trend — creatinine rising 0.3 mg/dL over 3 days is far more alarming than a static elevated value
  • Medication complexity: Count of medications, count of drug classes, polypharmacy flag (>10 medications)
  • Social determinants: Zip code poverty rate, distance to hospital — available from public databases, highly predictive

2.10 Financial Services — Credit Default Prediction

Building a credit scorecard requires both excellent EDA and regulatory-compliant feature engineering. In many jurisdictions, certain features are legally prohibited (race, gender, age in some contexts), requiring careful data governance.

EDA specifics for credit data:

  • Default rates are typically 1-5% — severe class imbalance requires specific handling (SMOTE, cost-sensitive learning)
  • Many applicants have thin credit files — missing features are MNAR (no credit history = different risk profile than those with missing data by chance)
  • Strong multicollinearity among credit bureau features — need to identify and handle redundant features (VIF analysis)

WoE (Weight of Evidence) encoding is the industry standard for credit because it:

  • Handles missing values naturally (missing = its own category)
  • Produces a monotonic transformation interpretable by regulators
  • Works well with logistic regression (the required interpretable model)
  • The IV (Information Value) provides a built-in feature selection metric

2.11 Manufacturing — Predictive Maintenance

Using sensor data (vibration, temperature, current draw) to predict machine failure before it happens. Feature engineering on time series sensor data is its own specialized domain.

Key feature engineering patterns:

  • Rolling statistics: Mean, std, min, max over 1/5/30 minute windows — captures both level and volatility changes that precede failure
  • FFT features: Fast Fourier Transform extracts frequency domain features from vibration signals — rotating machinery failure shows up as specific frequency harmonics before it's visible in raw signal
  • Change point features: Sudden shifts in sensor mean or variance are often the earliest failure signal — CUSUM statistics as features
  • Cross-sensor features: Temperature × vibration interaction, or deviation from expected temperature given current draw — encodes physics-based domain knowledge
  • Time-to-previous-maintenance: Features encoding when last service was performed and what was done

When EDA Reveals You Need a Different Model Entirely

📌 The Most Important Outcome of EDA

Sometimes thorough EDA reveals that the modeling approach you planned is wrong for the data structure you have. Examples: data with strong cluster structure suggests you need separate models per cluster rather than one global model. Zero-inflated data suggests a two-part model. Strong temporal autocorrelation means your train/test split must be temporal, not random. Data that changes in character over time (concept drift) means you need a model that retrains frequently. EDA should inform the entire modeling strategy — it's not just a preprocessing step.

A Production EDA & Feature Engineering Toolkit

3.1 Automated EDA Report

Python — Systematic data profiling from scratch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import shapiro, kstest, skew, kurtosis
import warnings
warnings.filterwarnings('ignore')

class DataProfiler:
    """
    Systematic EDA profiler. Produces an actionable report of every
    column's properties and recommended treatments.
    
    Covers: distributions, missing data, outliers, cardinality,
            skewness, correlations, and recommended transforms.
    """

    def __init__(self, df: pd.DataFrame, target: str = None):
        self.df     = df.copy()
        self.target = target
        self.report = {}

    def profile_numeric(self, col: str) -> dict:
        s = self.df[col].dropna()
        n_missing = self.df[col].isna().sum()
        pct_missing = n_missing / len(self.df)

        # Normality test (Shapiro for n<5000, KS for larger)
        if len(s) < 5000:
            _, p_normal = shapiro(s.sample(min(len(s), 5000), random_state=42))
        else:
            _, p_normal = kstest((s - s.mean()) / s.std(), 'norm')

        sk = skew(s)
        kurt = kurtosis(s)

        # Outlier detection: IQR method
        q1, q3 = s.quantile(0.25), s.quantile(0.75)
        iqr = q3 - q1
        n_outliers = ((s < q1 - 1.5*iqr) | (s > q3 + 1.5*iqr)).sum()

        # Recommend transformation
        transform_rec = 'none'
        if (s > 0).all() and sk > 1.0:
            transform_rec = 'log'
        elif (s >= 0).all() and sk > 0.5:
            transform_rec = 'sqrt'
        elif abs(sk) > 0.5:
            transform_rec = 'yeo-johnson'

        return {
            'type':          'numeric',
            'n':             len(s),
            'n_missing':     n_missing,
            'pct_missing':   pct_missing,
            'mean':          s.mean(),
            'median':        s.median(),
            'std':           s.std(),
            'min':           s.min(),
            'max':           s.max(),
            'p1':            s.quantile(0.01),
            'p99':           s.quantile(0.99),
            'skewness':      sk,
            'kurtosis':      kurt,
            'is_normal':     p_normal > 0.05,
            'n_outliers_iqr': n_outliers,
            'pct_outliers':  n_outliers / len(s),
            'transform_rec': transform_rec,
            'zero_pct':      (s == 0).mean(),
            'negative_pct':  (s < 0).mean(),
        }

    def profile_categorical(self, col: str) -> dict:
        s = self.df[col]
        n_missing   = s.isna().sum()
        n_unique    = s.nunique()
        top_5       = s.value_counts().head(5).to_dict()
        top1_pct    = s.value_counts(normalize=True).iloc[0] if n_unique > 0 else 0

        # Cardinality-based encoding recommendation
        if n_unique <= 2:    enc = 'binary'
        elif n_unique <= 10: enc = 'one-hot'
        elif n_unique <= 50: enc = 'target-encoding (with CV)'
        else:               enc = 'target-encoding or embedding'

        return {
            'type':          'categorical',
            'n_unique':      n_unique,
            'n_missing':     n_missing,
            'pct_missing':   n_missing / len(self.df),
            'top_5_values':  top_5,
            'top1_dominance': top1_pct,
            'encoding_rec':  enc,
        }

    def run(self) -> pd.DataFrame:
        rows = []
        for col in self.df.columns:
            if self.df[col].dtype in ['int64', 'float64']:
                info = self.profile_numeric(col)
            else:
                info = self.profile_categorical(col)
            info['column'] = col
            rows.append(info)

        summary = pd.DataFrame(rows).set_index('column')

        # Flag columns needing attention
        print("═"*60)
        print("DATA PROFILING REPORT — Action Items")
        print("═"*60)
        for col, row in summary.iterrows():
            issues = []
            if row.get('pct_missing', 0) > 0.05:
                issues.append(f"⚠ {row['pct_missing']:.0%} missing")
            if row.get('transform_rec', 'none') != 'none':
                issues.append(f"→ apply {row['transform_rec']}")
            if row.get('pct_outliers', 0) > 0.05:
                issues.append(f"⚠ {row['pct_outliers']:.0%} outliers")
            if issues:
                print(f"  {col:<30}" + "  |  ".join(issues))
        return summary

# ─── Usage ────────────────────────────────────────────────────
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame

profiler = DataProfiler(df, target='MedHouseVal')
report = profiler.run()
print("\nFull report:")
print(report[['type', 'pct_missing', 'skewness', 'transform_rec']].round(3))

3.2 Missing Data — MICE Imputation

Python — Multiple Imputation by Chained Equations
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import BayesianRidge

def prepare_imputation(df: pd.DataFrame, target_col: str = None) -> pd.DataFrame:
    """
    Production-grade missing data handling.
    
    Strategy:
    1. Add missingness indicator columns (always informative)
    2. Separate columns by missing rate
    3. Apply MICE for MAR columns
    4. Flag MNAR columns for manual review
    """
    df = df.copy()
    missing_pcts = df.isnull().mean()
    missing_cols = missing_pcts[missing_pcts > 0].index.tolist()

    # Step 1: Always add binary missingness indicators FIRST
    # The fact that data is missing is itself a feature
    for col in missing_cols:
        df[f'missing_{col}'] = df[col].isna().astype(int)

    print("Missingness indicators added for:", missing_cols)

    # Step 2: MICE imputation using Bayesian Ridge as the estimator
    # BayesianRidge is fast and works well for most numeric data
    # For categorical columns: use RandomForestClassifier instead
    numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
    numeric_cols = [c for c in numeric_cols if not c.startswith('missing_')]

    mice_imputer = IterativeImputer(
        estimator=BayesianRidge(),  # model each column given all others
        max_iter=10,              # iterations of chained equations
        random_state=42,
        verbose=0
    )

    df[numeric_cols] = mice_imputer.fit_transform(df[numeric_cols])
    print(f"MICE imputation complete for {len(numeric_cols)} numeric columns")

    return df, mice_imputer

# ─── Test with synthetic data with different missing mechanisms ─
np.random.seed(42)
n = 1000
df_test = pd.DataFrame({
    'age':    np.random.normal(45, 15, n),
    'income': np.random.lognormal(10, 1, n),
    'score':  np.random.normal(700, 80, n),
})

# MCAR: 10% randomly missing
mcar_mask = np.random.rand(n) < 0.10
df_test.loc[mcar_mask, 'score'] = np.nan

# MAR: income missing more for younger people
mar_mask = (np.random.rand(n) < 0.3) & (df_test['age'] < 35)
df_test.loc[mar_mask, 'income'] = np.nan

print(f"Missing before: {df_test.isnull().sum().to_dict()}")
df_imputed, imputer = prepare_imputation(df_test)
print(f"Missing after:  {df_imputed[['age','income','score']].isnull().sum().to_dict()}")

3.3 Feature Engineering Pipeline — Production Grade

Python — sklearn Pipeline with custom transformers
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (PowerTransformer, StandardScaler,
                                      OneHotEncoder, TargetEncoder)
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

# ─── Custom Transformer 1: Datetime Feature Extractor ─────────
class DatetimeFeatures(BaseEstimator, TransformerMixin):
    """Extract rich features from a datetime column using cyclical encoding."""

    def __init__(self, col: str):
        self.col = col

    def fit(self, X, y=None): return self

    def transform(self, X):
        X = X.copy()
        dt = pd.to_datetime(X[self.col])
        # Cyclical encoding: hour, day of week, month
        X['hour_sin']  = np.sin(2*np.pi*dt.dt.hour / 24)
        X['hour_cos']  = np.cos(2*np.pi*dt.dt.hour / 24)
        X['dow_sin']   = np.sin(2*np.pi*dt.dt.dayofweek / 7)
        X['dow_cos']   = np.cos(2*np.pi*dt.dt.dayofweek / 7)
        X['month_sin'] = np.sin(2*np.pi*dt.dt.month / 12)
        X['month_cos'] = np.cos(2*np.pi*dt.dt.month / 12)
        X['is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
        X['is_month_end'] = dt.dt.is_month_end.astype(int)
        return X.drop(columns=[self.col])

# ─── Custom Transformer 2: RFM Features ───────────────────────
class RFMFeatures(BaseEstimator, TransformerMixin):
    """
    Compute Recency-Frequency-Monetary features from transaction data.
    snapshot_date: the "as of" date for recency calculation.
    windows_days: time windows for rolling aggregations.
    """

    def __init__(self, customer_col, date_col, amount_col,
                 windows_days=[7, 30, 90], snapshot_date=None):
        self.customer_col  = customer_col
        self.date_col      = date_col
        self.amount_col    = amount_col
        self.windows_days  = windows_days
        self.snapshot_date = snapshot_date

    def fit(self, X, y=None): return self

    def transform(self, X):
        X = X.copy()
        X[self.date_col] = pd.to_datetime(X[self.date_col])
        snap = self.snapshot_date or X[self.date_col].max()

        features = []
        for cid, grp in X.groupby(self.customer_col):
            grp = grp.sort_values(self.date_col)
            row = {self.customer_col: cid}

            # Recency (lower = more recent = better)
            last_tx = grp[self.date_col].max()
            row['recency_days'] = (snap - last_tx).days
            row['log_recency']  = np.log1p(row['recency_days'])

            # Rolling frequency and monetary for each window
            for w in self.windows_days:
                cutoff = snap - pd.Timedelta(days=w)
                recent = grp[grp[self.date_col] >= cutoff]
                row[f'freq_{w}d']      = len(recent)
                row[f'total_spend_{w}d'] = recent[self.amount_col].sum()
                row[f'avg_spend_{w}d']   = recent[self.amount_col].mean() if len(recent) > 0 else 0

            # Trend: spend in last 30d vs 31-60d (momentum feature)
            w1 = snap - pd.Timedelta(days=30)
            w2 = snap - pd.Timedelta(days=60)
            recent_30  = grp[grp[self.date_col] >= w1][self.amount_col].sum()
            prev_30    = grp[(grp[self.date_col] >= w2) &
                             (grp[self.date_col] <  w1)][self.amount_col].sum()
            row['spend_momentum'] = np.log1p(recent_30) - np.log1p(prev_30)

            features.append(row)

        return pd.DataFrame(features).set_index(self.customer_col)


# ─── Custom Transformer 3: Outlier Winsorizer ─────────────────
class Winsorizer(BaseEstimator, TransformerMixin):
    """Clip numeric features at specified quantile bounds."""

    def __init__(self, lower=0.01, upper=0.99):
        self.lower = lower; self.upper = upper
        self.bounds_ = {}

    def fit(self, X, y=None):
        if isinstance(X, pd.DataFrame):
            for col in X.columns:
                self.bounds_[col] = (X[col].quantile(self.lower),
                                     X[col].quantile(self.upper))
        else:
            for i in range(X.shape[1]):
                self.bounds_[i] = (np.quantile(X[:,i], self.lower),
                                   np.quantile(X[:,i], self.upper))
        return self

    def transform(self, X):
        X = X.copy()
        if isinstance(X, pd.DataFrame):
            for col, (lo, hi) in self.bounds_.items():
                X[col] = X[col].clip(lo, hi)
        else:
            for i, (lo, hi) in self.bounds_.items():
                X[:,i] = np.clip(X[:,i], lo, hi)
        return X


# ─── Full Feature Pipeline ─────────────────────────────────────
def build_feature_pipeline(
    numeric_cols:    list,
    categorical_low: list,   # low cardinality → one-hot
    categorical_high: list,  # high cardinality → target encoding
) -> ColumnTransformer:
    """
    Builds a production sklearn ColumnTransformer that:
    - Winsorizes then Yeo-Johnson transforms numeric columns
    - One-hot encodes low-cardinality categoricals
    - Target-encodes high-cardinality categoricals
    
    Fit once on train set. Apply to train, val, test.
    No data leakage possible since all stats computed on train only.
    """
    numeric_pipeline = Pipeline([
        ('winsorize',  Winsorizer(lower=0.01, upper=0.99)),
        ('transform',  PowerTransformer(method='yeo-johnson')),
        ('scale',      StandardScaler()),
    ])

    categorical_low_pipeline = Pipeline([
        ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
    ])

    # TargetEncoder in sklearn >= 1.3 does cross-fit automatically
    categorical_high_pipeline = Pipeline([
        ('te', TargetEncoder(smooth='auto')),
    ])

    preprocessor = ColumnTransformer([
        ('numeric',    numeric_pipeline,          numeric_cols),
        ('cat_low',    categorical_low_pipeline,  categorical_low),
        ('cat_high',   categorical_high_pipeline, categorical_high),
    ], remainder='drop')

    return preprocessor

3.4 Feature Selection — Boruta + SHAP

Python — Rigorous feature selection with Boruta and SHAP
import numpy as np
import pandas as pd
import shap
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import fetch_california_housing

def shap_feature_importance(
    model, X_train: pd.DataFrame, n_display: int = 20
) -> pd.DataFrame:
    """
    Compute SHAP-based feature importance.
    
    SHAP is superior to built-in feature_importances_ because:
    1. Handles correlated features correctly (splits credit)
    2. Provides direction (positive/negative effect)
    3. Is consistent: a feature that helps more always gets higher SHAP
    4. Model-agnostic (works for any model)
    """
    explainer    = shap.TreeExplainer(model)
    shap_values  = explainer.shap_values(X_train)

    # Mean absolute SHAP value = importance
    mean_abs_shap = np.abs(shap_values).mean(axis=0)
    importance_df = pd.DataFrame({
        'feature':    X_train.columns,
        'shap_importance': mean_abs_shap,
        'mean_shap':  shap_values.mean(axis=0),  # direction of effect
    }).sort_values('shap_importance', ascending=False)

    print(f"Top {n_display} features by SHAP importance:")
    print(importance_df.head(n_display).to_string(index=False))

    # SHAP summary plot
    shap.summary_plot(shap_values, X_train, show=False)
    return importance_df

def permutation_importance_cv(
    model, X: pd.DataFrame, y: pd.Series,
    n_repeats: int = 10
) -> pd.DataFrame:
    """
    Permutation importance: shuffle one feature at a time,
    measure performance drop. Features that matter show large drops.
    
    More honest than impurity-based importance for correlated features.
    """
    from sklearn.inspection import permutation_importance
    result = permutation_importance(
        model, X, y, n_repeats=n_repeats, random_state=42
    )
    perm_df = pd.DataFrame({
        'feature':   X.columns,
        'importance_mean': result.importances_mean,
        'importance_std':  result.importances_std,
    }).sort_values('importance_mean', ascending=False)

    # Features with importance < 0 are actively hurting the model
    negative = perm_df[perm_df['importance_mean'] < 0]
    if len(negative) > 0:
        print(f"WARNING: {len(negative)} features have NEGATIVE importance:")
        print(negative['feature'].tolist())
        print("Consider removing these — they actively hurt your model.")

    return perm_df

# ─── Example: California housing ──────────────────────────────
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)

importance_df = shap_feature_importance(rf, X)

3.5 Leakage Detection — Automated Checks

Python — Detecting target leakage before it destroys your model
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def leakage_detector(
    df: pd.DataFrame,
    target: str,
    threshold_corr: float = 0.85,
    threshold_auc:  float = 0.95
) -> pd.DataFrame:
    """
    Automatically flag features that may be leaking target information.
    
    Two checks:
    1. Correlation check: features with |corr| > threshold are suspicious
    2. Single-feature AUC check: a feature that alone predicts the target
       with AUC > threshold is almost certainly leaking
    """
    flags = []
    X = df.drop(columns=[target])
    y = df[target]

    for col in X.columns:
        row = {'feature': col, 'leakage_flag': False, 'reason': []}

        # Check 1: Pearson correlation with target
        if X[col].dtype in ['float64', 'int64']:
            corr = abs(X[col].corr(y))
            row['correlation'] = corr
            if corr > threshold_corr:
                row['leakage_flag'] = True
                row['reason'].append(f"HIGH CORR={corr:.3f}")

        # Check 2: Single-feature predictive power (for classification)
        if y.nunique() == 2 and X[col].dtype in ['float64', 'int64']:
            clf = RandomForestClassifier(n_estimators=10, random_state=42)
            x_col = X[[col]].fillna(X[col].median())
            auc = cross_val_score(clf, x_col, y, cv=3, scoring='roc_auc').mean()
            row['single_feature_auc'] = auc
            if auc > threshold_auc:
                row['leakage_flag'] = True
                row['reason'].append(f"SINGLE FEATURE AUC={auc:.3f}")

        row['reason'] = ' | '.join(row['reason']) if row['reason'] else 'clean'
        flags.append(row)

    result = pd.DataFrame(flags).sort_values('leakage_flag', ascending=False)
    leaked = result[result['leakage_flag']]
    if len(leaked) > 0:
        print("⚠️  POTENTIAL LEAKAGE DETECTED:")
        print(leaked[['feature', 'reason']].to_string())
    else:
        print("✅ No obvious leakage detected.")
    return result

  
🧠 Section 3 Check — Implementation
  1. The DataProfiler recommends a log transform for a column with skewness 2.3. But the column has 15% zeros. Why would log(x) fail here, and what should you use instead? How would you modify the profile_numeric method to catch this case?
  2. You're building a churn model. Your leakage_detector flags support_ticket_reason = "cancellation" with AUC=0.97. Does this mean you should remove the feature? Under what circumstances could it be a legitimate (non-leaking) feature?
  3. The RFMFeatures transformer uses np.log1p for recency. Why log1p specifically instead of log? What would happen to a customer with 0 days since last purchase if you used log?
For Q1: log(0) = -infinity. Use np.log1p(x) which computes log(1+x) and handles zeros gracefully. Update the transform_rec logic to check (s >= 0).all() instead of (s > 0).all() and use 'log1p' as the recommendation.

The Cutting Edge of EDA & Feature Engineering

4.1 Automated Feature Engineering (AutoFE)

Manual feature engineering is the most time-consuming part of the ML pipeline and requires deep domain expertise. Research into automating this process has produced several powerful approaches.

Deep Feature Synthesis: Towards Automating Data Science Endeavors
Kanter & Veeramachaneni, 2015 · MIT · Implemented as Featuretools
Introduces Deep Feature Synthesis (DFS), which automatically generates features from relational databases by stacking aggregation and transformation primitives across entity relationships. Given a database schema (customers → orders → products), DFS generates hundreds of meaningful features like "max purchase amount in last 30 days" automatically. The featuretools library implements this. In Kaggle competitions, DFS-generated features routinely outperform manually engineered ones. Key limitation: computational cost scales with data size and schema complexity.
AutoFELiX: Automated Feature Engineering for Relational Data
Lam et al., 2023 · Microsoft Research
Uses LLMs to suggest features based on the semantic meaning of column names and metadata — moving AutoFE beyond purely statistical pattern matching. Given columns named "last_login_date" and "signup_date", the LLM proposes "days_since_last_login" and "account_age_days" as meaningful derived features. Represents the convergence of LLMs and traditional ML pipelines.

4.2 Self-Supervised Representation Learning for Tabular Data

Rather than manually engineering features, self-supervised learning learns rich representations directly from raw tabular data without labels. This is the tabular equivalent of what BERT did for NLP.

SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption
Bahri et al., 2022 · Google Brain
Trains an encoder on tabular data by randomly corrupting subsets of features and learning to distinguish corrupted from original rows (contrastive learning). The learned representations outperform hand-crafted features on downstream tasks, especially in low-label regimes. Shows that tabular data has learnable structure that self-supervised methods can exploit. Particularly valuable when labeled data is scarce but unlabeled data is abundant — a common situation in enterprise analytics.
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
Hollmann et al., 2022 · AutoML Freiburg
A pre-trained transformer that performs in-context learning on tabular data — it processes the entire training set as context and predicts test labels without any fine-tuning. For small datasets (<1000 rows), it achieves state-of-art performance against XGBoost and AutoML systems, often in milliseconds. Represents a fundamental shift: instead of feature engineering + model training, the model itself learns the feature structure from context. Available as a Python library.

4.3 Missing Data — GAIN and Deep Imputation

GAIN: Missing Data Imputation using Generative Adversarial Nets
Yoon, Jordon & van der Schaar, 2018 · ICML
Uses a GAN to impute missing values by training a generator to fill in missing data and a discriminator to distinguish imputed from real values. Significantly outperforms MICE on high-missingness, complex-dependency scenarios. The key insight: MICE assumes linear relationships between variables for imputation; GAIN learns the full nonlinear joint distribution. Most practically significant for datasets with >30% missingness and complex inter-variable relationships.

4.4 Feature Interaction Discovery — Neural Network Approaches

Neural Interaction Detection
Tsang, Cheng & Liu, 2018 · NeurIPS
Uses the weight matrix structure of trained neural networks to automatically detect which feature interactions are most important. More powerful than manually trying all pairwise interactions. The detected interactions can be added as explicit features to gradient boosting models, combining the expressiveness of neural interaction detection with the performance of tree-based models. Available as the FAST (Feature Interaction and Structured Testing) framework.

4.5 Open Research Problems

  • EDA for high-dimensional data: When you have 10,000 features, systematic univariate and bivariate EDA becomes computationally intractable. Scalable EDA methods that identify the most important patterns without exhaustive pair-wise analysis are an active area.
  • Feature engineering for heterogeneous data: Real enterprise data combines tabular, text, images, time series, and graph data. Principled methods for creating unified feature representations from these diverse sources remain an open problem.
  • Temporal leakage detection: Automated detection of temporal leakage in complex pipelines — where the leakage path goes through multiple feature transformations — is still largely unsolved. Current tools require manual inspection.
  • Feature engineering under distribution shift: Features engineered on training data may become useless or harmful when the data distribution shifts. Robust feature engineering that generalizes across distribution shifts is an active research area.
  • Causal feature selection: Traditional feature selection finds correlates of the target. Causal feature selection finds causes. The practical difference: a causal feature remains predictive when the distribution changes; a correlated feature may not. Methods from causal discovery (PC algorithm, FCI, GES) are being integrated with ML pipelines.

4.6 Essential Libraries for 2026

LibraryPurposeStandout Feature
ydata-profilingAutomated EDA reportsOne-line HTML report with all distributions, correlations, missing data
featuretoolsAutomated feature engineeringDeep Feature Synthesis from relational data
category_encodersAll encoding strategiesWoE, Target, James-Stein, Binary, all with proper CV support
missingnoMissing data visualizationMatrix, heatmap, dendrogram of missing patterns — MCAR vs MAR detection
shapFeature importance & explanationTreeExplainer for XGBoost/LightGBM is near-instant even on large models
skrubDirty data & encodingFuzzy string matching, GapEncoder for messy categories
TabPFNSmall dataset classificationZero feature engineering needed for <1000 rows, near-instant training
dirty_catHigh-cardinality categoricalsSimilarityEncoder handles typos and variations in string categories
🧠 Chapter 2 — Final Questions
  1. You're given a dataset with 500 features to build a churn model. Walk through your complete EDA process step by step — what specific things are you looking for at each stage, and what decisions does each stage inform?
  2. A feature days_since_last_purchase has a correlation of 0.72 with your target (churn). Your leakage detector flags it. Is this leakage? Justify your answer by thinking through whether this information would be available at prediction time.
  3. You have a categorical feature city with 8,000 unique values. Half of the cities have fewer than 5 observations in your training set. Why does standard target encoding fail catastrophically for these rare cities, and how does smooth (shrinkage) target encoding fix it mathematically?
  4. Your model's training AUC is 0.91 but validation AUC is 0.64. You've already done train/test split correctly. Name three feature engineering errors that could cause this specific pattern.
For Q3: For rare cities, the mean target over 5 observations is extremely noisy. Smooth encoding blends the category estimate toward the global mean: θ_smooth = (n × θ_category + m × θ_global) / (n + m), where m is a smoothing hyperparameter. As n → 0, the estimate collapses to the global mean (safe); as n → ∞, it converges to the category mean (informative).