Exercise 1 – ML Basics

Introduction to Machine Learning

$$ % math spaces % N, naturals % Z, integers % Q, rationals % R, reals % C, complex % C, space of continuous functions % machine numbers % maximum error % counting / finite sets % set 0, 1 % set -1, 1 % unit interval % basic math stuff % x tilde % argmax % argmin % argmax with limits % argmin with limits
% sign, signum % I, indicator % O, order % partial derivative % floor % ceiling % sums and products % summation from i=1 to n % summation from i=1 to m % summation from j=1 to p % summation from j=1 to p % summation from i=1 to k % summation from k=1 to g % summation from j=1 to g % mean from i=1 to n % mean from i=1 to n % mean from k=1 to g % product from i=1 to n % product from k=1 to g % product from j=1 to p % linear algebra % 1, unitvector % 0-vector % I, identity % diag, diagonal % tr, trace % span % <.,.>, scalarproduct % short pmatrix command % matrix A % error term for vectors % basic probability + stats % P, probability % E, expectation % Var, variance % Cov, covariance % Corr, correlation % N of the normal distribution % dist with i.i.d superscript

% … is distributed as …

% X, input space % Y, output space % set from 1 to n % set from 1 to p % set from 1 to g % P_xy % E_xy: Expectation over random variables xy % vector x (bold) % vector x-tilde (bold) % vector y (bold) % observation (x, y) % (x1, …, xp) % Design matrix % The set of all datasets % The set of all datasets of size n % D, data % D_n, data of size n % D_train, training set % D_test, test set % (x^i, y^i), i-th observation % {(x1,y1)), …, (xn,yn)}, data % Def. of the set of all datasets of size n % Def. of the set of all datasets % {x1, …, xn}, input data % {y1, …, yn}, input data % (y1, …, yn), vector of outcomes % x^i, i-th observed value of x

% y^i, i-th observed value of y % (x1^i, …, xp^i), i-th observation vector % x_j, j-th feature % (x^1_j, …, x^n_j), j-th feature vector % Basis transformation function phi % Basis transformation of xi: phi^i := phi(xi)

%%%%%% ml - models general % lambda vector, hyperconfiguration vector % Lambda, space of all hpos % Inducer / Inducing algorithm % Set of all datasets times the hyperparameter space % Set of all datasets times the hyperparameter space % Inducer / Inducing algorithm % Inducer, inducing algorithm, learning algorithm

% continuous prediction function f % True underlying function (if a statistical model is assumed) % True underlying function (if a statistical model is assumed) % f(x), continuous prediction function % f with domain and co-domain % hypothesis space where f is from % Bayes-optimal model % Bayes-optimal model

% f_j(x), discriminant component function % f hat, estimated prediction function % fhat(x) % f(x | theta) % f(x^(i)) % f(x^(i)) % f(x^(i) | theta) % fhat_D, estimate of f based on D % fhat_Dtrain, estimate of f based on D %model learned on Dn with hp lambda %model learned on D with hp lambda %model learned on Dn with optimal hp lambda %model learned on D with optimal hp lambda

% discrete prediction function h % h(x), discrete prediction function % h hat % hhat(x) % h(x | theta) % h(x^(i)) % h(x^(i) | theta) % Bayes-optimal classification model % Bayes-optimal classification model

% yhat % yhat for prediction of target % yhat^(i) for prediction of ith targiet

% theta % theta hat % theta vector % theta vector hat %

% %theta learned on Dn with hp lambda %theta learned on D with hp lambda % min problem theta % argmin theta

% densities + probabilities % pdf of x % p % p(x) % pi(x|theta), pdf of x given theta % pi(x^i|theta), pdf of x given theta % pi(x^i), pdf of i-th x

% pdf of (x, y) % p(x, y) % p(x, y | theta) % p(x^(i), y^(i) | theta)

% pdf of x given y % p(x | y = k) % log p(x | y = k)

% p(x^i | y = k)

% prior probabilities % pi_k, prior

% log pi_k, log of the prior % Prior probability of parameter theta

% pi_k(x^(i)) with hat % p(y | x, theta) % p(y^i |x^i, theta) % log p(y | x, theta) % log p(y^i |x^i, theta)

% probababilistic

% Bayes rule % mean vector of class-k Gaussian (discr analysis)

% residual and margin % residual, stochastic % epsilon^i, residual, stochastic % residual, estimated % y f(x), margin % y^i f(x^i), margin % estimated covariance matrix % estimated covariance matrix for the j-th class

% ml - loss, risk, likelihood % L(y, f), loss function % L(y, pi), loss function % L(y, f(x)), loss function % loss of observation % loss with f parameterized % loss of observation with f parameterized % loss of observation with f parameterized % loss in classification % loss in classification % loss of observation in classification % loss with pi parameterized % loss of observation with pi parameterized % L(y, h(x)), loss function on discrete classes % L(r), loss defined on residual (reg) / margin (classif) % L1 loss % L2 loss % Bernoulli loss for -1, +1 encoding % Bernoulli loss for 0, 1 encoding % cross-entropy loss % Brier score % R, risk % R(f), risk % risk def (expected loss) % R(theta), risk % R_emp, empirical risk w/o factor 1 / n % R_emp, empirical risk w/ factor 1 / n % R_emp(f) % R_emp(theta) % R_reg, regularized risk % R_reg(theta) % R_reg(f) % hat R_reg(theta) % hat R_emp(theta) % L, likelihood % L(theta), likelihood % L(theta|x), likelihood % l, log-likelihood % l(theta), log-likelihood % l(theta|x), log-likelihood % training error % test error % avg training error

% lm % linear model % OLS estimator in LM $$

Exercise 1: HRO in coding frameworks

Throughout the lecture, we will frequently use the R package mlr3, resp. the Python package sklearn, and its descendants, providing an integrated ecosystem for all common machine learning tasks. Let’s recap the HRO principle and see how it is reflected in either mlr3 or sklearn. An overview of the most important objects and their usage, illustrated with numerous examples, can be found at the mlr3 book and the scikit documentation.

How are the key concepts (i.e., hypothesis space, risk and optimization) you learned about in the lecture videos implemented?

Solution

# You initialize your `learner` with its properties defined by the parameters
model <- lrn("regr.lm")
print(model)
# Before training them on actual data, they just contain information on the
# functional form of f. Once a learner has been trained we can examine the
# parameters of the resulting model.
x <- seq(0, 8, by = 0.01)
set.seed(42)
y <- -1 + 3 * x + rnorm(mean = 0, sd = 4, n = length(x))
dt <- data.frame(x = x, y = y)
task <- TaskRegr$new(id = "mytask", backend = dt, target = "y")
# Optimization happens rather implicitly as sklearn only acts as a wrapper for
# existing implementations and calls package-specific optimization procedures
# within the function `model.fit()`:
model$train(task)
sprintf("Model MSE: %.2f", model$predict_newdata(dt)$score())

<LearnerRegrLM:regr.lm>
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3learners, stats
* Predict Types:  [response], se
* Feature Types: logical, integer, numeric, character, factor
* Properties: loglik, weights

'Model MSE: 15.10'

# You initialize your "learner" or model with its properties defined by the 
# parameters, e.g.,:
model = LinearRegression(fit_intercept=True)
# Before training them on actual data, they just contain information on the 
# functional form of f. Once a learner has been trained we can examine the 
# parameters of the resulting model.
print(model)
x = np.arange(0, 8, 0.01)
np.random.seed(42)
y = -1 + 3*x + np.random.normal(loc=0.0, scale=4, size=len(x))
# Optimization happens rather implicitly as sklearn only acts as a wrapper for 
# existing implementations and calls package-specific optimization procedures 
# within the function `model.fit()`:
model.fit(x.reshape(-1, 1),y) # reshape for one feature design matrix
print(
    'Model MSE: ', metrics.mean_squared_error(y, model.predict(x.reshape(-1, 1)))
)

LinearRegression()
Model MSE:  15.461825608784347

Have a look atmlr3::tsk("iris") / sklearn.datasets.load_iris. What attributes does this object store?

Solution

tsk("iris")

<TaskClassif:iris> (150 x 5): Iris Flowers
* Target: Species
* Properties: multiclass
* Features (4):
  - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

iris = load_iris() 
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Type of object iris:", type(iris))
print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nShape of X and y\n", X.shape, y.shape)
print("\nType of X and y\n", type(X), type(y))

Type of object iris: <class 'sklearn.utils._bunch.Bunch'>
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

Shape of X and y
 (150, 4) (150,)

Type of X and y
 <class 'numpy.ndarray'> <class 'numpy.ndarray'>

Instantiate a regression tree learner. What are the different settings for this learner?

R Hint: use lrn("regr.rpart") (mlr3::mlr_learners$keys() shows all available learners).
Python Hint: use the DecisionTreeRegressor module and use get_params() to see all available settings.

Solution

# List available learners in base mlr3 package
mlr_learners$keys()

# Inspect regression tree learner
lrn("regr.rpart")

# List configurable hyperparameters
lrn("regr.rpart")$param_set

'classif.cv_glmnet'
'classif.debug'
'classif.featureless'
'classif.glmnet'
'classif.kknn'
'classif.lda'
'classif.log_reg'
'classif.multinom'
'classif.naive_bayes'
'classif.nnet'
'classif.qda'
'classif.ranger'
'classif.rpart'
'classif.svm'
'classif.xgboost'
'clust.agnes'
'clust.ap'
'clust.cmeans'
'clust.cobweb'
'clust.dbscan'
'clust.diana'
'clust.em'
'clust.fanny'
'clust.featureless'
'clust.ff'
'clust.hclust'
'clust.kkmeans'
'clust.kmeans'
'clust.MBatchKMeans'
'clust.mclust'
'clust.meanshift'
'clust.pam'
'clust.SimpleKMeans'
'clust.xmeans'
'regr.cv_glmnet'
'regr.debug'
'regr.featureless'
'regr.glmnet'
'regr.kknn'
'regr.km'
'regr.lm'
'regr.nnet'
'regr.ranger'
'regr.rpart'
'regr.svm'
'regr.xgboost'

<LearnerRegrRpart:regr.rpart>: Regression Tree
* Model: -
* Parameters: xval=0
* Packages: mlr3, rpart
* Predict Types:  [response]
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, selected_features, weights

<ParamSet>
                id    class lower upper nlevels        default value
 1:             cp ParamDbl     0     1     Inf           0.01      
 2:     keep_model ParamLgl    NA    NA       2          FALSE      
 3:     maxcompete ParamInt     0   Inf     Inf              4      
 4:       maxdepth ParamInt     1    30      30             30      
 5:   maxsurrogate ParamInt     0   Inf     Inf              5      
 6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]>      
 7:       minsplit ParamInt     1   Inf     Inf             20      
 8: surrogatestyle ParamInt     0     1       2              0      
 9:   usesurrogate ParamInt     0     2       3              2      
10:           xval ParamInt     0   Inf     Inf             10     0

# help(DecisionTreeRegressor) 
rtree = DecisionTreeRegressor() # default setting
print(rtree.get_params())

{'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}

Exercise 2: Loss functions for regression tasks

In this exercise, we will examine loss functions for regression tasks somewhat more in depth.

set.seed(1L)
x <- runif(20L, min = 0L, max = 10L)
y <- 0.2 + 3 * x
y <- y + rnorm(length(x), sd = 0.8)

ggplot2::ggplot(data.frame(x = x, y = y), ggplot2::aes(x = x, y = y)) +
  ggplot2::geom_point() + 
  ggplot2::theme_bw() + 
  # ggplot2::geom_smooth(formula = y ~ x, method = "lm", se = FALSE) +
  ggplot2::annotate("point", x = 10L, y = 1L, color = "orange", size = 2)

Consider the above linear regression task. How will the model parameters be affected by adding the new outlier point (orange) if you use

$L1$ loss
$L2$ loss in the empirical risk? (You do not need to actually compute the parameter values.)

huber_loss <- function(res, delta = 0.5) {
  if (abs(res) <= delta) {
    0.5 * (res^2)
  } else {
    delta * abs(res) - 0.5 * (delta^2)
  }
}

x <- seq(-10L, 10L, length.out = 1000L)
y <- sapply(x, huber_loss, delta = 5L)

ggplot2::ggplot(data.frame(x = x, y = y), ggplot2::aes(x = x, y = y)) +
  ggplot2::geom_line() + 
  ggplot2::theme_bw()

The second plot visualizes another loss function popular in regression tasks, the so-called (depending on $\epsilon > 0$; here: $\epsilon = 5$). Describe how the Huber loss deals with residuals as compared to $L1$ and $L2$ loss. Can you guess its definition?

Exercise 3: Polynomial regression

Assume the following (noisy) data-generating process from which we have observed 50 realizations: \[y = -3 + 5 \cdot \sin(0.4 \pi x) + \epsilon\] with $\epsilon \, \sim \mathcal{N}(0, 1)$.

We decide to model the data with a cubic polynomial (including intercept term). State the corresponding hypothesis space.
State the empirical risk w.r.t. $\boldsymbol{\theta}$ for a member of the hypothesis space. Use $L2$ loss and be as explicit as possible.
We can minimize this risk using gradient descent. Derive the gradient of the empirical risk w.r.t $\boldsymbol{\theta}$.
Using the result for the gradient, state the calculation to update the current parameter $\boldsymbol{\theta}^{[t]}$.
You will not be able to fit the data perfectly with a cubic polynomial. Describe the advantages and disadvantages that a more flexible model class would have. Would you opt for a more flexible learner?

Exercise 4: Predicting `abalone`

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Solution

Lorem ipsum dolor sit amet.

foo # bug: needs text between embeddings (issue)

Exercise 1: HRO in coding frameworks

Exercise 2: Loss functions for regression tasks

Exercise 3: Polynomial regression

Exercise 4: Predicting abalone

Exercise 4: Predicting `abalone`