Chapter 22: Preprocessing
This chapter covers essential preprocessing techniques for machine learning, including feature engineering, categorical encoding, feature and target transformations, handling missing data, and time series preprocessing.
-
Chapter 22.01: Introduction to Feature Engineering
We place feature engineering in the ML workflow and pipeline context, explain how it addresses common data science challenges like skewed distributions, high-cardinality categorical features, and missingness, and discuss why it still matters in the age of deep learning.
-
Chapter 22.02: Categorical Encoding
We cover one-hot encoding and target/impact encoding for categorical features.
-
Chapter 22.03: Feature Transformations
We discuss feature transformations including normalization (standardization) and Box-Cox transformations for stabilizing variance.
-
Chapter 22.04: Target Transformations
We explain why and when to transform the target variable using log, Box-Cox, and Yeo-Johnson transformations, and introduce the transform-fit-invert pipeline.
-
Chapter 22.05: Data Leakage
We explain the concept of train-test leakage, why it leads to overoptimistic performance estimates, and how to avoid it in preprocessing pipelines.
-
Chapter 22.06: Missing Data - Introduction
We motivate the problem of missing data, show how to visualize missingness in a dataset, and outline the main options for handling it (dropping observations, dropping features, models that tolerate missing values, and imputation).
-
Chapter 22.07: Simple Imputation Methods
We cover simple univariate imputation strategies (mean, median, mode, out-of-range constants), the use of missingness indicator variables, the distribution shift caused by constant imputation, and imputation by sampling from the empirical feature distribution.
-
Chapter 22.08: Model-Based Imputation
We introduce model-based imputation, where a surrogate model is trained on the other features to predict the missing values, and discuss its drawbacks: sensitivity to the choice of surrogate model, the need for the surrogate to handle missing values itself, and per-feature hyperparameter tuning.
-
Chapter 22.09: Time Series - Introduction
We introduce time series forecasting, explain how to apply tree-based models to time series, and cover detrending and multi-step forecasting.
-
Chapter 22.10: Time Series Feature Engineering
We cover feature engineering for time series including calendar features, cyclic encoding, lagged and rolling features, and feature importance.
-
Chapter 22.11: Time Series Evaluation
We discuss time series cross-validation (expanding and sliding window), evaluation metrics, baselines, and practical guidance.