Exploratory Data Analysis
& Feature Engineering
Most models fail not because of the algorithm — they fail because of bad data understanding and poor feature design. This chapter covers the systematic craft of knowing your data deeply and transforming it into signals a model can actually use.
EDA Is Not Plotting. Feature Engineering Is Not Intuition.
Most people think EDA means making histograms until something looks interesting. That's not EDA — that's wandering. Real EDA is a systematic investigation with specific questions: What is the shape of each variable? What relationships exist? What assumptions does my intended model make, and does my data violate them? Where is the data wrong?
Feature engineering is similarly misunderstood as "adding new columns." The real definition: feature engineering is the process of encoding domain knowledge and data structure into a representation that makes the learning problem easier. A well-engineered feature can make a linear model outperform a deep neural network.
Every machine learning algorithm is a function approximator. Its job is to find a mapping from features to targets. Feature engineering reshapes the input space so that the mapping becomes simpler — ideally linear. A nonlinear relationship in raw space might be perfectly linear in log-space. A cyclical feature (hour of day) that confuses a tree might be trivially separable when encoded as sin/cos components. The model's job gets easier every time you make the underlying structure more explicit.
2.1 The EDA Framework — Systematic, Not Random
Effective EDA always answers questions in this order. Skipping steps produces missed problems that destroy models later.
Before touching the data: how was it collected? What does each row represent? What could go wrong at collection time? A clickstream row is not the same as a transaction row. Sensor data from a factory has different failure modes than survey data. The answers determine everything else.
Distribution shape (normal? skewed? bimodal?), range, cardinality for categoricals, missing rate, outlier presence. The goal is not to look at pretty plots — it's to identify every variable that needs transformation before modeling.
Feature-target relationships (which features actually predict your target?), feature-feature correlations (multicollinearity), conditional distributions (does the relationship between X and Y change based on a third variable Z?). This is where you find your most powerful features.
Missing values (MCAR, MAR, MNAR — they require different treatments), duplicates, impossible values (negative ages, future dates), inconsistent encoding (same category spelled multiple ways), unit inconsistencies, temporal leakage.
Cluster structure in the feature space, class overlap for classification tasks, interaction effects between features, temporal trends if data has a time dimension. Dimensionality reduction (PCA, UMAP) helps visualize this.
2.2 Distributions in Depth — Reading the Shape
A histogram tells you the story of your data if you know how to read it. Each shape implies specific modeling decisions.
| Shape Pattern | What It Implies | What To Do |
|---|---|---|
| Right-skewed (long right tail) | Multiplicative process, bounded below at 0. Revenue, income, counts, durations. | Log transform or Box-Cox. Use median not mean for summary stats. Consider log-normal or gamma models. |
| Bimodal (two humps) | Two subpopulations mixed. E.g. users who open emails within 1 min or not at all. | Find and encode the grouping variable. Consider modeling the two groups separately. Mixture models. |
| Uniform distribution | Often an ID or hash column disguised as a numeric. Or a truly uninformative feature. | Check if it's actually an ID. If genuinely uniform and numeric, it may have weak predictive power. |
| Spike at zero + continuous tail | Zero-inflated data. Many users spend $0; those who spend anything are log-normal. | Two-part model: classify zero vs non-zero, then model the non-zero part separately. |
| Discrete with large gaps | Manual data entry, rounding, or ordinal masquerading as numeric. | Treat as ordinal or categorical. Check if gaps are meaningful or artifacts. |
| Heavy-tailed (extreme outliers) | Power-law or Pareto process. Social media followers, city sizes, wealth. | Winsorize or clip at a quantile for regression. Log transform. Understand if outliers are real. |
2.3 Missing Data — Three Mechanisms, Three Treatments
Missing data is not a problem to be "fixed" by deleting rows or filling with the mean. It is a phenomenon that carries information about why data is missing. Getting this wrong causes biased models.
Mean/median imputation is almost always wrong and causes three specific problems: it shrinks variance (all imputed values are identical, reducing variability), it distorts correlations (replacing missing values with the mean breaks the relationship between that variable and others), and it creates impossible values (imputing mean age = 34.7 when age should be an integer, or mean income in a bimodal distribution falling between the two modes where nobody actually is).
Use instead: MICE (Multiple Imputation by Chained Equations) for MAR data. Add a binary missingness indicator as a feature regardless of imputation method — the fact that a value was missing is itself informative.
2.4 Outliers — Signal or Noise?
The most dangerous question in analytics is: "Should I remove this outlier?" The answer depends entirely on why the value is extreme, which requires understanding the data generation process.
Outliers That Are Errors — Remove/Fix
- Age = 999 (data entry error)
- Revenue = -50,000 (sign error)
- Date = year 2099 (system default)
- Duplicate rows from JOIN explosion
- Test/bot traffic in production logs
- Sensor reading during calibration
Outliers That Are Real — Keep/Model
- A whale customer spending 100× average
- A viral post with 1M views vs normal 100
- A fraudulent transaction (the whole point)
- A crash event in safety-critical systems
- Record-breaking stock market day
- A medical patient with rare condition
1. Detect using IQR fences (Q3 + 1.5×IQR), Z-score (>3σ), or Isolation Forest for multivariate outliers. 2. Investigate each one — do not automate removal. 3. Document your decision and the reasoning. 4. If keeping for modeling, use robust methods (Huber loss, quantile regression, tree-based models) that are insensitive to outliers rather than deleting real data. 5. Winsorize (cap at 99th percentile) as a middle ground when outliers are real but extreme enough to destabilize linear models.
2.5 Feature Engineering — The Five Families
Feature engineering techniques fall into five categories. A deep practitioner has all five in their toolkit and knows when each is appropriate.
Family 1: Numerical Transformations
Reshaping the distribution of continuous variables to satisfy model assumptions or make relationships linear.
| Transform | Formula | Use When | Inverse |
|---|---|---|---|
| Log | log(x) or log1p(x) | Right-skewed positive data. Makes multiplicative relationships additive. | exp(x) |
| Square Root | √x | Count data (Poisson). Milder than log. Works on zeros. | x² |
| Box-Cox | (xᵅ-1)/λ | Optimal normalization — finds the best λ from data. | Parametric |
| Yeo-Johnson | Extended Box-Cox | Like Box-Cox but handles negative values and zeros. | Parametric |
| Rank / Quantile | Replace with quantile | When you need uniform or normal marginal distributions. Robust to outliers. | None |
| Winsorization | Clip at p/1-p quantiles | Reduce outlier influence without removing data. | None |
| Binning | Cut into intervals | When relationship is step-wise, not smooth. Handles non-monotonic effects. | None |
Family 2: Categorical Encoding
Converting categorical variables into numerical representations. The choice of encoding significantly impacts model performance and is highly model-dependent.
| Encoding | Output | Best For | Avoid When |
|---|---|---|---|
| One-Hot (OHE) | k-1 binary columns | Low-cardinality nominals, linear models, neural nets | High cardinality (>50 categories) — curse of dimensionality |
| Label / Ordinal | Integer 0 to k-1 | Ordinal categories (low/med/high), tree models | Nominal categories with no order — implies false ordering |
| Target Encoding | Mean target per category | High-cardinality in tree models (cities, products, users) | Without cross-validation — causes massive target leakage |
| Count / Frequency | Category frequency | When frequency itself is informative (popular vs rare) | When frequency correlates with target spuriously |
| Binary / Hash | k binary columns | Very high cardinality, memory constraints | When collision rate is unacceptable for your cardinality |
| Embeddings | Dense vector (8-32 dim) | Neural nets, very high cardinality (user IDs, product IDs) | Small datasets — not enough data to learn embedding quality |
| WoE (Weight of Evidence) | Log(P(event)/P(non-event)) | Binary classification, credit scoring, logistic regression | Multi-class problems — WoE is inherently binary |
Target encoding replaces a category with the mean target value for that category. On the surface, this seems ideal for high-cardinality features like user IDs or zip codes. The trap: you're using the target to create your features. If done naively on the full dataset, the model trains on features that directly encode the answer — this is target leakage, and it produces spectacular train metrics and terrible production performance.
Correct approach: Always use out-of-fold target encoding. For each fold in cross-validation, compute target means on the other folds only. Libraries like category_encoders handle this correctly with the TargetEncoder class.
Family 3: Datetime Features
Time is one of the richest sources of features in business data. Raw timestamps are useless — their meaning must be extracted and encoded correctly.
| Feature | Why It Matters | Encoding Note |
|---|---|---|
| Hour of day | Traffic, purchases, support tickets all have hourly patterns | Use sin/cos: sin(2π·hour/24), cos(2π·hour/24) — encodes cyclical continuity (23:59 is close to 00:01) |
| Day of week | B2B vs B2C behave differently weekday vs weekend | sin/cos encoding: sin(2π·dow/7) |
| Month / Season | Seasonality in sales, churn, engagement | sin/cos for cyclical, or one-hot if pattern is not smooth |
| Is weekend / holiday | Strong behavioral discontinuity at these boundaries | Binary flag. Use a holiday calendar library (workalendar, holidays) |
| Time since event | Recency: days since last purchase, last login, last support ticket | Raw numeric. Often log-transform for heavy-tailed distributions |
| Time to event | Days until contract renewal, subscription expiry, warranty end | Raw numeric or bucketed (30/60/90 day windows) |
| Elapsed time in period | Day 3 of month vs day 28 — payroll timing effects | Normalize: day_of_month / days_in_month → [0,1] |
Family 4: Interaction & Polynomial Features
Linear models can only capture additive effects. If the effect of feature A depends on the value of feature B, you need to encode that interaction explicitly. Tree models find interactions automatically — linear models, neural nets, and SVMs do not.
Family 5: Aggregation & Window Features
The richest features for tabular ML are often aggregations across related rows — computing behavioral summaries per entity (user, product, store) over time windows.
2.6 The Leakage Problem — The Silent Model Killer
Target leakage occurs when your features contain information that would not be available at prediction time. It is the single most common cause of models that perform brilliantly in evaluation and catastrophically in production.
- Hospital readmission model: Included "discharge_type" as a feature. Patients who died in hospital never got readmitted — discharge_type="deceased" perfectly predicted no readmission. Not available at prediction time.
- Fraud detection: Included "fraud_review_flag" set by the fraud team — this was set after fraud was discovered, not before. The model learned to detect when the fraud team had already found fraud.
- Churn prediction: Included "last_contact_reason = cancellation_inquiry" — this was set when the customer called to cancel, which is exactly when they were about to churn. Available at prediction time, but encodes the target almost perfectly.
- Loan default: Used future account balance data — the aggregation included data from months after the loan decision point.
Temporal discipline: For any feature that changes over time, define a strict "feature cutoff date" before the prediction date. All features must be computed using only data available before that cutoff. Treat-time simulation: Simulate exactly what data would have been available at the moment the prediction would be made in production. Suspicious AUC check: If a new feature increases AUC from 0.72 to 0.98, assume leakage first. Investigate before celebrating.
2.7 Feature Selection — Less Is Often More
Adding more features does not monotonically improve model performance. Irrelevant and redundant features add noise, increase training time, make models less interpretable, and in small datasets cause overfitting.
| Method | Type | How It Works | Strength |
|---|---|---|---|
| Correlation / Mutual Info | Filter | Rank features by univariate relationship with target | Fast, model-agnostic. Good first pass. |
| Variance Threshold | Filter | Remove near-zero variance features | Removes constants and near-constants instantly |
| Lasso (L1) | Embedded | Regularization drives irrelevant feature coefficients to exactly 0 | Automatic, handles collinear features |
| Feature Importance (Trees) | Embedded | Average impurity decrease per feature across all trees | Captures non-linear relationships |
| Permutation Importance | Model-agnostic | Shuffle each feature; measure performance drop | Works for any model, respects interactions |
| SHAP Values | Model-agnostic | Game-theoretic attribution of each feature's contribution | Most faithful importance measure, handles correlations |
| RFE (Recursive Feature Elimination) | Wrapper | Iteratively remove least important features, re-train | Finds optimal subset but computationally expensive |
| Boruta | Wrapper | Compare real features against randomly shuffled shadow features | Statistically rigorous, finds all relevant features |
EDA & Feature Engineering Across Industries
2.8 E-Commerce — Customer Lifetime Value Prediction
Predicting which customers will be most valuable over the next 12 months. The quality of the feature set determines almost everything about model performance.
Raw data available: order history, browsing sessions, email interactions, support tickets, demographics.
EDA reveals:
- CLV is log-normally distributed — 80% of customers contribute 20% of revenue. Mean CLV of $240 is meaningless; median $45 is more representative
- 53% of customers made exactly 1 purchase and never returned — the "one-and-done" segment dominates
- Strong bimodality in session duration: 30 seconds (bounce) vs 8 minutes (engaged). Two different user types mixed in one dataset
- Missing email open rate for 34% of customers — MNAR (non-openers have no open data), needs a separate "no_email_data" flag
Feature engineering decisions:
- Log-transform CLV target before regression — residuals are then approximately normal
- Create
is_repeat_buyerbinary flag — the one-and-done segment needs a separate model or a strong indicator - RFM features over 30/60/90/180 day windows — recency decay is exponential, so
log(days_since_last_purchase) - Session depth score:
pages_viewed × avg_time_per_page— single interaction captures both breadth and depth - Category affinity vector: normalized purchase amounts per category — who is a "shoes person" vs "electronics person"
- Email engagement ratio:
opens / sendswith Laplace smoothing for low-send customers
2.9 Healthcare — Readmission Risk Prediction
Predicting 30-day hospital readmission from electronic health records (EHR). Feature engineering here directly impacts patient outcomes.
Critical EDA findings:
- Diagnosis codes (ICD-10) have 70,000+ unique values — direct one-hot encoding is impossible. Need hierarchical grouping (CCS categories) or embedding
- Lab values have extreme missingness (40-80%) that is MNAR — sicker patients get more tests, so missing lab = probably less sick is often wrong
- Length of stay is right-skewed with a spike at 0 (same-day discharge) — zero-inflated, needs separate treatment
- Temporal leakage risk: discharge medications, discharge instructions, and follow-up appointment bookings all occur after the readmission risk window starts
Feature engineering patterns specific to healthcare:
- Comorbidity indices: Charlson Comorbidity Index and Elixhauser scores aggregate diagnosis codes into validated risk scores — domain knowledge compressed into a single feature
- Lab trajectory: Not just the last lab value but the trend — creatinine rising 0.3 mg/dL over 3 days is far more alarming than a static elevated value
- Medication complexity: Count of medications, count of drug classes, polypharmacy flag (>10 medications)
- Social determinants: Zip code poverty rate, distance to hospital — available from public databases, highly predictive
2.10 Financial Services — Credit Default Prediction
Building a credit scorecard requires both excellent EDA and regulatory-compliant feature engineering. In many jurisdictions, certain features are legally prohibited (race, gender, age in some contexts), requiring careful data governance.
EDA specifics for credit data:
- Default rates are typically 1-5% — severe class imbalance requires specific handling (SMOTE, cost-sensitive learning)
- Many applicants have thin credit files — missing features are MNAR (no credit history = different risk profile than those with missing data by chance)
- Strong multicollinearity among credit bureau features — need to identify and handle redundant features (VIF analysis)
WoE (Weight of Evidence) encoding is the industry standard for credit because it:
- Handles missing values naturally (missing = its own category)
- Produces a monotonic transformation interpretable by regulators
- Works well with logistic regression (the required interpretable model)
- The IV (Information Value) provides a built-in feature selection metric
2.11 Manufacturing — Predictive Maintenance
Using sensor data (vibration, temperature, current draw) to predict machine failure before it happens. Feature engineering on time series sensor data is its own specialized domain.
Key feature engineering patterns:
- Rolling statistics: Mean, std, min, max over 1/5/30 minute windows — captures both level and volatility changes that precede failure
- FFT features: Fast Fourier Transform extracts frequency domain features from vibration signals — rotating machinery failure shows up as specific frequency harmonics before it's visible in raw signal
- Change point features: Sudden shifts in sensor mean or variance are often the earliest failure signal — CUSUM statistics as features
- Cross-sensor features: Temperature × vibration interaction, or deviation from expected temperature given current draw — encodes physics-based domain knowledge
- Time-to-previous-maintenance: Features encoding when last service was performed and what was done
When EDA Reveals You Need a Different Model Entirely
Sometimes thorough EDA reveals that the modeling approach you planned is wrong for the data structure you have. Examples: data with strong cluster structure suggests you need separate models per cluster rather than one global model. Zero-inflated data suggests a two-part model. Strong temporal autocorrelation means your train/test split must be temporal, not random. Data that changes in character over time (concept drift) means you need a model that retrains frequently. EDA should inform the entire modeling strategy — it's not just a preprocessing step.
A Production EDA & Feature Engineering Toolkit
3.1 Automated EDA Report
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import shapiro, kstest, skew, kurtosis
import warnings
warnings.filterwarnings('ignore')
class DataProfiler:
"""
Systematic EDA profiler. Produces an actionable report of every
column's properties and recommended treatments.
Covers: distributions, missing data, outliers, cardinality,
skewness, correlations, and recommended transforms.
"""
def __init__(self, df: pd.DataFrame, target: str = None):
self.df = df.copy()
self.target = target
self.report = {}
def profile_numeric(self, col: str) -> dict:
s = self.df[col].dropna()
n_missing = self.df[col].isna().sum()
pct_missing = n_missing / len(self.df)
# Normality test (Shapiro for n<5000, KS for larger)
if len(s) < 5000:
_, p_normal = shapiro(s.sample(min(len(s), 5000), random_state=42))
else:
_, p_normal = kstest((s - s.mean()) / s.std(), 'norm')
sk = skew(s)
kurt = kurtosis(s)
# Outlier detection: IQR method
q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
n_outliers = ((s < q1 - 1.5*iqr) | (s > q3 + 1.5*iqr)).sum()
# Recommend transformation
transform_rec = 'none'
if (s > 0).all() and sk > 1.0:
transform_rec = 'log'
elif (s >= 0).all() and sk > 0.5:
transform_rec = 'sqrt'
elif abs(sk) > 0.5:
transform_rec = 'yeo-johnson'
return {
'type': 'numeric',
'n': len(s),
'n_missing': n_missing,
'pct_missing': pct_missing,
'mean': s.mean(),
'median': s.median(),
'std': s.std(),
'min': s.min(),
'max': s.max(),
'p1': s.quantile(0.01),
'p99': s.quantile(0.99),
'skewness': sk,
'kurtosis': kurt,
'is_normal': p_normal > 0.05,
'n_outliers_iqr': n_outliers,
'pct_outliers': n_outliers / len(s),
'transform_rec': transform_rec,
'zero_pct': (s == 0).mean(),
'negative_pct': (s < 0).mean(),
}
def profile_categorical(self, col: str) -> dict:
s = self.df[col]
n_missing = s.isna().sum()
n_unique = s.nunique()
top_5 = s.value_counts().head(5).to_dict()
top1_pct = s.value_counts(normalize=True).iloc[0] if n_unique > 0 else 0
# Cardinality-based encoding recommendation
if n_unique <= 2: enc = 'binary'
elif n_unique <= 10: enc = 'one-hot'
elif n_unique <= 50: enc = 'target-encoding (with CV)'
else: enc = 'target-encoding or embedding'
return {
'type': 'categorical',
'n_unique': n_unique,
'n_missing': n_missing,
'pct_missing': n_missing / len(self.df),
'top_5_values': top_5,
'top1_dominance': top1_pct,
'encoding_rec': enc,
}
def run(self) -> pd.DataFrame:
rows = []
for col in self.df.columns:
if self.df[col].dtype in ['int64', 'float64']:
info = self.profile_numeric(col)
else:
info = self.profile_categorical(col)
info['column'] = col
rows.append(info)
summary = pd.DataFrame(rows).set_index('column')
# Flag columns needing attention
print("═"*60)
print("DATA PROFILING REPORT — Action Items")
print("═"*60)
for col, row in summary.iterrows():
issues = []
if row.get('pct_missing', 0) > 0.05:
issues.append(f"⚠ {row['pct_missing']:.0%} missing")
if row.get('transform_rec', 'none') != 'none':
issues.append(f"→ apply {row['transform_rec']}")
if row.get('pct_outliers', 0) > 0.05:
issues.append(f"⚠ {row['pct_outliers']:.0%} outliers")
if issues:
print(f" {col:<30}" + " | ".join(issues))
return summary
# ─── Usage ────────────────────────────────────────────────────
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame
profiler = DataProfiler(df, target='MedHouseVal')
report = profiler.run()
print("\nFull report:")
print(report[['type', 'pct_missing', 'skewness', 'transform_rec']].round(3))
3.2 Missing Data — MICE Imputation
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import BayesianRidge
def prepare_imputation(df: pd.DataFrame, target_col: str = None) -> pd.DataFrame:
"""
Production-grade missing data handling.
Strategy:
1. Add missingness indicator columns (always informative)
2. Separate columns by missing rate
3. Apply MICE for MAR columns
4. Flag MNAR columns for manual review
"""
df = df.copy()
missing_pcts = df.isnull().mean()
missing_cols = missing_pcts[missing_pcts > 0].index.tolist()
# Step 1: Always add binary missingness indicators FIRST
# The fact that data is missing is itself a feature
for col in missing_cols:
df[f'missing_{col}'] = df[col].isna().astype(int)
print("Missingness indicators added for:", missing_cols)
# Step 2: MICE imputation using Bayesian Ridge as the estimator
# BayesianRidge is fast and works well for most numeric data
# For categorical columns: use RandomForestClassifier instead
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
numeric_cols = [c for c in numeric_cols if not c.startswith('missing_')]
mice_imputer = IterativeImputer(
estimator=BayesianRidge(), # model each column given all others
max_iter=10, # iterations of chained equations
random_state=42,
verbose=0
)
df[numeric_cols] = mice_imputer.fit_transform(df[numeric_cols])
print(f"MICE imputation complete for {len(numeric_cols)} numeric columns")
return df, mice_imputer
# ─── Test with synthetic data with different missing mechanisms ─
np.random.seed(42)
n = 1000
df_test = pd.DataFrame({
'age': np.random.normal(45, 15, n),
'income': np.random.lognormal(10, 1, n),
'score': np.random.normal(700, 80, n),
})
# MCAR: 10% randomly missing
mcar_mask = np.random.rand(n) < 0.10
df_test.loc[mcar_mask, 'score'] = np.nan
# MAR: income missing more for younger people
mar_mask = (np.random.rand(n) < 0.3) & (df_test['age'] < 35)
df_test.loc[mar_mask, 'income'] = np.nan
print(f"Missing before: {df_test.isnull().sum().to_dict()}")
df_imputed, imputer = prepare_imputation(df_test)
print(f"Missing after: {df_imputed[['age','income','score']].isnull().sum().to_dict()}")
3.3 Feature Engineering Pipeline — Production Grade
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (PowerTransformer, StandardScaler,
OneHotEncoder, TargetEncoder)
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
# ─── Custom Transformer 1: Datetime Feature Extractor ─────────
class DatetimeFeatures(BaseEstimator, TransformerMixin):
"""Extract rich features from a datetime column using cyclical encoding."""
def __init__(self, col: str):
self.col = col
def fit(self, X, y=None): return self
def transform(self, X):
X = X.copy()
dt = pd.to_datetime(X[self.col])
# Cyclical encoding: hour, day of week, month
X['hour_sin'] = np.sin(2*np.pi*dt.dt.hour / 24)
X['hour_cos'] = np.cos(2*np.pi*dt.dt.hour / 24)
X['dow_sin'] = np.sin(2*np.pi*dt.dt.dayofweek / 7)
X['dow_cos'] = np.cos(2*np.pi*dt.dt.dayofweek / 7)
X['month_sin'] = np.sin(2*np.pi*dt.dt.month / 12)
X['month_cos'] = np.cos(2*np.pi*dt.dt.month / 12)
X['is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
X['is_month_end'] = dt.dt.is_month_end.astype(int)
return X.drop(columns=[self.col])
# ─── Custom Transformer 2: RFM Features ───────────────────────
class RFMFeatures(BaseEstimator, TransformerMixin):
"""
Compute Recency-Frequency-Monetary features from transaction data.
snapshot_date: the "as of" date for recency calculation.
windows_days: time windows for rolling aggregations.
"""
def __init__(self, customer_col, date_col, amount_col,
windows_days=[7, 30, 90], snapshot_date=None):
self.customer_col = customer_col
self.date_col = date_col
self.amount_col = amount_col
self.windows_days = windows_days
self.snapshot_date = snapshot_date
def fit(self, X, y=None): return self
def transform(self, X):
X = X.copy()
X[self.date_col] = pd.to_datetime(X[self.date_col])
snap = self.snapshot_date or X[self.date_col].max()
features = []
for cid, grp in X.groupby(self.customer_col):
grp = grp.sort_values(self.date_col)
row = {self.customer_col: cid}
# Recency (lower = more recent = better)
last_tx = grp[self.date_col].max()
row['recency_days'] = (snap - last_tx).days
row['log_recency'] = np.log1p(row['recency_days'])
# Rolling frequency and monetary for each window
for w in self.windows_days:
cutoff = snap - pd.Timedelta(days=w)
recent = grp[grp[self.date_col] >= cutoff]
row[f'freq_{w}d'] = len(recent)
row[f'total_spend_{w}d'] = recent[self.amount_col].sum()
row[f'avg_spend_{w}d'] = recent[self.amount_col].mean() if len(recent) > 0 else 0
# Trend: spend in last 30d vs 31-60d (momentum feature)
w1 = snap - pd.Timedelta(days=30)
w2 = snap - pd.Timedelta(days=60)
recent_30 = grp[grp[self.date_col] >= w1][self.amount_col].sum()
prev_30 = grp[(grp[self.date_col] >= w2) &
(grp[self.date_col] < w1)][self.amount_col].sum()
row['spend_momentum'] = np.log1p(recent_30) - np.log1p(prev_30)
features.append(row)
return pd.DataFrame(features).set_index(self.customer_col)
# ─── Custom Transformer 3: Outlier Winsorizer ─────────────────
class Winsorizer(BaseEstimator, TransformerMixin):
"""Clip numeric features at specified quantile bounds."""
def __init__(self, lower=0.01, upper=0.99):
self.lower = lower; self.upper = upper
self.bounds_ = {}
def fit(self, X, y=None):
if isinstance(X, pd.DataFrame):
for col in X.columns:
self.bounds_[col] = (X[col].quantile(self.lower),
X[col].quantile(self.upper))
else:
for i in range(X.shape[1]):
self.bounds_[i] = (np.quantile(X[:,i], self.lower),
np.quantile(X[:,i], self.upper))
return self
def transform(self, X):
X = X.copy()
if isinstance(X, pd.DataFrame):
for col, (lo, hi) in self.bounds_.items():
X[col] = X[col].clip(lo, hi)
else:
for i, (lo, hi) in self.bounds_.items():
X[:,i] = np.clip(X[:,i], lo, hi)
return X
# ─── Full Feature Pipeline ─────────────────────────────────────
def build_feature_pipeline(
numeric_cols: list,
categorical_low: list, # low cardinality → one-hot
categorical_high: list, # high cardinality → target encoding
) -> ColumnTransformer:
"""
Builds a production sklearn ColumnTransformer that:
- Winsorizes then Yeo-Johnson transforms numeric columns
- One-hot encodes low-cardinality categoricals
- Target-encodes high-cardinality categoricals
Fit once on train set. Apply to train, val, test.
No data leakage possible since all stats computed on train only.
"""
numeric_pipeline = Pipeline([
('winsorize', Winsorizer(lower=0.01, upper=0.99)),
('transform', PowerTransformer(method='yeo-johnson')),
('scale', StandardScaler()),
])
categorical_low_pipeline = Pipeline([
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
# TargetEncoder in sklearn >= 1.3 does cross-fit automatically
categorical_high_pipeline = Pipeline([
('te', TargetEncoder(smooth='auto')),
])
preprocessor = ColumnTransformer([
('numeric', numeric_pipeline, numeric_cols),
('cat_low', categorical_low_pipeline, categorical_low),
('cat_high', categorical_high_pipeline, categorical_high),
], remainder='drop')
return preprocessor
3.4 Feature Selection — Boruta + SHAP
import numpy as np
import pandas as pd
import shap
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import fetch_california_housing
def shap_feature_importance(
model, X_train: pd.DataFrame, n_display: int = 20
) -> pd.DataFrame:
"""
Compute SHAP-based feature importance.
SHAP is superior to built-in feature_importances_ because:
1. Handles correlated features correctly (splits credit)
2. Provides direction (positive/negative effect)
3. Is consistent: a feature that helps more always gets higher SHAP
4. Model-agnostic (works for any model)
"""
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
# Mean absolute SHAP value = importance
mean_abs_shap = np.abs(shap_values).mean(axis=0)
importance_df = pd.DataFrame({
'feature': X_train.columns,
'shap_importance': mean_abs_shap,
'mean_shap': shap_values.mean(axis=0), # direction of effect
}).sort_values('shap_importance', ascending=False)
print(f"Top {n_display} features by SHAP importance:")
print(importance_df.head(n_display).to_string(index=False))
# SHAP summary plot
shap.summary_plot(shap_values, X_train, show=False)
return importance_df
def permutation_importance_cv(
model, X: pd.DataFrame, y: pd.Series,
n_repeats: int = 10
) -> pd.DataFrame:
"""
Permutation importance: shuffle one feature at a time,
measure performance drop. Features that matter show large drops.
More honest than impurity-based importance for correlated features.
"""
from sklearn.inspection import permutation_importance
result = permutation_importance(
model, X, y, n_repeats=n_repeats, random_state=42
)
perm_df = pd.DataFrame({
'feature': X.columns,
'importance_mean': result.importances_mean,
'importance_std': result.importances_std,
}).sort_values('importance_mean', ascending=False)
# Features with importance < 0 are actively hurting the model
negative = perm_df[perm_df['importance_mean'] < 0]
if len(negative) > 0:
print(f"WARNING: {len(negative)} features have NEGATIVE importance:")
print(negative['feature'].tolist())
print("Consider removing these — they actively hurt your model.")
return perm_df
# ─── Example: California housing ──────────────────────────────
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)
importance_df = shap_feature_importance(rf, X)
3.5 Leakage Detection — Automated Checks
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
def leakage_detector(
df: pd.DataFrame,
target: str,
threshold_corr: float = 0.85,
threshold_auc: float = 0.95
) -> pd.DataFrame:
"""
Automatically flag features that may be leaking target information.
Two checks:
1. Correlation check: features with |corr| > threshold are suspicious
2. Single-feature AUC check: a feature that alone predicts the target
with AUC > threshold is almost certainly leaking
"""
flags = []
X = df.drop(columns=[target])
y = df[target]
for col in X.columns:
row = {'feature': col, 'leakage_flag': False, 'reason': []}
# Check 1: Pearson correlation with target
if X[col].dtype in ['float64', 'int64']:
corr = abs(X[col].corr(y))
row['correlation'] = corr
if corr > threshold_corr:
row['leakage_flag'] = True
row['reason'].append(f"HIGH CORR={corr:.3f}")
# Check 2: Single-feature predictive power (for classification)
if y.nunique() == 2 and X[col].dtype in ['float64', 'int64']:
clf = RandomForestClassifier(n_estimators=10, random_state=42)
x_col = X[[col]].fillna(X[col].median())
auc = cross_val_score(clf, x_col, y, cv=3, scoring='roc_auc').mean()
row['single_feature_auc'] = auc
if auc > threshold_auc:
row['leakage_flag'] = True
row['reason'].append(f"SINGLE FEATURE AUC={auc:.3f}")
row['reason'] = ' | '.join(row['reason']) if row['reason'] else 'clean'
flags.append(row)
result = pd.DataFrame(flags).sort_values('leakage_flag', ascending=False)
leaked = result[result['leakage_flag']]
if len(leaked) > 0:
print("⚠️ POTENTIAL LEAKAGE DETECTED:")
print(leaked[['feature', 'reason']].to_string())
else:
print("✅ No obvious leakage detected.")
return result
🧠 Section 3 Check — Implementation
- The
DataProfiler recommends a log transform for a column with skewness 2.3. But the column has 15% zeros. Why would log(x) fail here, and what should you use instead? How would you modify the profile_numeric method to catch this case?
- You're building a churn model. Your
leakage_detector flags support_ticket_reason = "cancellation" with AUC=0.97. Does this mean you should remove the feature? Under what circumstances could it be a legitimate (non-leaking) feature?
- The
RFMFeatures transformer uses np.log1p for recency. Why log1p specifically instead of log? What would happen to a customer with 0 days since last purchase if you used log?
For Q1: log(0) = -infinity. Use np.log1p(x) which computes log(1+x) and handles zeros gracefully. Update the transform_rec logic to check (s >= 0).all() instead of (s > 0).all() and use 'log1p' as the recommendation.
The Cutting Edge of EDA & Feature Engineering
4.1 Automated Feature Engineering (AutoFE)
Manual feature engineering is the most time-consuming part of the ML pipeline and requires deep domain expertise. Research into automating this process has produced several powerful approaches.
featuretools library implements this. In Kaggle competitions, DFS-generated features routinely outperform manually engineered ones. Key limitation: computational cost scales with data size and schema complexity.4.2 Self-Supervised Representation Learning for Tabular Data
Rather than manually engineering features, self-supervised learning learns rich representations directly from raw tabular data without labels. This is the tabular equivalent of what BERT did for NLP.
4.3 Missing Data — GAIN and Deep Imputation
4.4 Feature Interaction Discovery — Neural Network Approaches
4.5 Open Research Problems
- EDA for high-dimensional data: When you have 10,000 features, systematic univariate and bivariate EDA becomes computationally intractable. Scalable EDA methods that identify the most important patterns without exhaustive pair-wise analysis are an active area.
- Feature engineering for heterogeneous data: Real enterprise data combines tabular, text, images, time series, and graph data. Principled methods for creating unified feature representations from these diverse sources remain an open problem.
- Temporal leakage detection: Automated detection of temporal leakage in complex pipelines — where the leakage path goes through multiple feature transformations — is still largely unsolved. Current tools require manual inspection.
- Feature engineering under distribution shift: Features engineered on training data may become useless or harmful when the data distribution shifts. Robust feature engineering that generalizes across distribution shifts is an active research area.
- Causal feature selection: Traditional feature selection finds correlates of the target. Causal feature selection finds causes. The practical difference: a causal feature remains predictive when the distribution changes; a correlated feature may not. Methods from causal discovery (PC algorithm, FCI, GES) are being integrated with ML pipelines.
4.6 Essential Libraries for 2026
| Library | Purpose | Standout Feature |
|---|---|---|
| ydata-profiling | Automated EDA reports | One-line HTML report with all distributions, correlations, missing data |
| featuretools | Automated feature engineering | Deep Feature Synthesis from relational data |
| category_encoders | All encoding strategies | WoE, Target, James-Stein, Binary, all with proper CV support |
| missingno | Missing data visualization | Matrix, heatmap, dendrogram of missing patterns — MCAR vs MAR detection |
| shap | Feature importance & explanation | TreeExplainer for XGBoost/LightGBM is near-instant even on large models |
| skrub | Dirty data & encoding | Fuzzy string matching, GapEncoder for messy categories |
| TabPFN | Small dataset classification | Zero feature engineering needed for <1000 rows, near-instant training |
| dirty_cat | High-cardinality categoricals | SimilarityEncoder handles typos and variations in string categories |
- You're given a dataset with 500 features to build a churn model. Walk through your complete EDA process step by step — what specific things are you looking for at each stage, and what decisions does each stage inform?
- A feature
days_since_last_purchasehas a correlation of 0.72 with your target (churn). Your leakage detector flags it. Is this leakage? Justify your answer by thinking through whether this information would be available at prediction time. - You have a categorical feature
citywith 8,000 unique values. Half of the cities have fewer than 5 observations in your training set. Why does standard target encoding fail catastrophically for these rare cities, and how does smooth (shrinkage) target encoding fix it mathematically? - Your model's training AUC is 0.91 but validation AUC is 0.64. You've already done train/test split correctly. Name three feature engineering errors that could cause this specific pattern.