Chapter 04: Performance Evaluation

This chapter treats the challenge of evaluating the performance of a model. We will introduce different performance measures for regression and classification tasks, explain the problem of overfitting as well as the difference between training and test error, and, lastly, present a variety of resampling techniques.

Chapter 04.00: Evaluation: In a Nutshell
In this nutshell chunk, we delve into the critical aspects of evaluation, unraveling how we measure and ensure the effectiveness and accuracy of machine learning models.
Chapter 04.01: Generalization Error
It is a crucial part of machine learning to evaluate the performance of a learner. We will explain the concept of generalization error and the difference between inner and outer loss.
Chapter 04.02: Measures Regression
In this section we familiarize ourselves with essential performance measures for regression. In particular, mean squared error (MSE), mean absolute error (MAE), and a straightforward generalization of $R^2$ are discussed.
Chapter 04.03: Training Error
There are two types of errors: training errors and test errors. The focus of this section is on the training error and related difficulties.
Chapter 04.04: Test Error
While we can infer some information about the learning process from training errors (e.g., the state of iterative optimization), we are truly interested in generalization ability, and thus in the test error on previously unseen data.
Chapter 04.05: Overfitting & Underfitting
In machine learning, we are interested in a model that captures the true underlying function and still generalizes well to new data. When the model fails on the first task, we speak of underfitting, and both train and test error will be high. On the other hand, learning the training data very well at the expense of generalization ability is referred to as overfitting and usually occurs when there is not enough data to tell our hypotheses apart. We will show you examples of this behavior and how to diagnose overfitting.
Chapter 04.06: Resampling 1
Different resampling techniques help to assess the performance of a learner while avoiding potential quirks resulting from a single train-test split. We will introduce cross-validation (with and without stratification), bootstrap and subsampling.
Chapter 04.07: Resampling 2
We provide a deep dive on resampling, showing its superiority to holdout splitting and analyzing the bias-variance decomposition of its MSE. We further point out the dependence between CV fold results and that hypothesis testing is therefore not applicable, and give some practical tips to choose resampling strategies.
Chapter 04.08: Measures Classification
Analogous to regression, we consider essential performance measures for classification. As a classifier predicts either class labels or scores/probabilities, its performance can be evaluated based on these two notions. We show some performance measures for classification, including misclassification error rate (MCE), accuracy (ACC) and Brier score (BS). In addition, we will see confusion matrices and learn about costs.
Chapter 04.09: ROC Basics
From the confusion matrix we can calculate a variety of ROC metrics. Among others, we will explain true positive rate, negative predictive value and the $F1$ measure.
Chapter 04.10: ROC Curves
In this section, we explain the ROC curve and how to calculate it. In addition, we will present the AUC as a global performance measure that integrates over all possible thresholds.
Chapter 04.11: Partial AUC & Multi-Class AUC
We discuss both the partial AUC, which restricts the AUC to the relevant area for a specific application, and possible extensions of the AUC to multi-class classification.
Chapter 04.12: Precision-Recall Curves
Besides plotting TPR against FPR to obtain the ROC curve, it sometimes makes sense to instead consider precision (= PPV) vs recall (= TPR), especially when data are imbalanced.
Chapter 04.13: AUC & Mann-Whitney-U Test
We demonstrate that the AUC is equivalent to the normalized test statistic in the Mann-Whitney-U test, both of which are effectively rank-based metrics.