Chapter 4 - Training Models

Summary notes

Models for training or training of models? The latter is the topic and the training is done by tuning model parameters.

General note: Remember to scale your data for best training experience.

Linear regression - "plain"

$LaTeX: ŷ=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n$ $ŷ=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n$ , where the x:s are features and the thetas model parameters.

The task is to find Theta that minimizes the error of the training set (note1) and to do this in an efficient way.

In order to perform the task efficiently the usage of gradient descent (GD) is proposed. We are to find the bottom of a bowl by taking a step down based on local downward direction. Three options are mentioned for finding the local downward direction (gradient):

Use all data in each iteration to find local gradient=> Batch gradient descent
Use a random single sample in each iteration to find local gradient => Stochastic gradient descent
Use a random set of samples in each iteration to find local gradient => Mini-batch gradient descent.

In case polynomial dependencies are expected these can be put at features i.e. LaTeX: x_2 $x_2$ can be allocated to represent LaTeX: x^2_1 $x^2_1$

Linear regression with using Regularized Linear Models

Returing to note 1 above, optimizing with regards to the training set of data is accually not the goal. This may make it worse due to the risk of overfitting the data. What we want to achieve a small error for the validation data. To achieve this the model should be as simple as possible (but not simpler than that!). By creating extra terms for the cost-function when training that promotes low values of Thetas less complexity is achieved. A weight of this extra cost is present that needs tuning too. The extra cost term can be an second order norm of TH (ridge regression), a first order norm (lasso regression) or a weighted combination of these (elastic net).

(The extra terms are only used for training, not in the prediction model. So here we acctually have a training model that is not equal to the prediction model. Ref title of chapter.)

Early stopping

Another method for preventing overfitting is to stop training when the predictions at its peak performance. By checking the validation error while training the error should decrease as the training takes effect. When the training seizes to have effect and rather increase it is when using this method time to stop the training.

Logistic regression

Here training and classification are combined. We are training a model but the cost function is based on how well it can make a binary classification. The model with the model parameters Theta is combined with a sigmoid function to give a probability value 0 to 1 with is then evaluated in the cost function. The problem can be solved using gradient descent methods as above,

Softmax Regression

The logistic regression function creates a binary classifier. For a multiclass problem the softmax regression can be used. It is similar to the logistic regression however for each class a different set of Theta is created. Each class get a score. Given this score a probability of the instance of being of that class can be calculated. The estimated class will be the class with the highest probability.

Here the difference between the sigmoid and softmax functions is described a bit more: https://dataaspirant.com/2017/03/07/difference-between-softmax-function-and-sigmoid-function/ Links to an external site.

Session agenda

Summary of the chapter.

Discussion of the exercises/questions given in the book.

Discussion on approach of the last exercise (computer exercise).

Extra task:

Ridge regression has another name as in Tikhonov regularization. What causes it to be called ridge regression? I.e. think about in what way it could be related to ridges.

(The lasso regression is an abbreviation of Least Absolute Shrinkage and Selection Operator Regression and has nothing to do with lassos. However, considering the that it is more prone to removing/ignoring the impact of features (than ridge regression) one can imagine the function having a lasso. And by using the lasso the function catches thetas from the theta herd and zeros them.)