Exercise 2
- Due No Due Date
- Points None
Exercise 2.1
On the lecture we considered a least squares model
yi=xTiθ+ noise ,i=1,…,N
where each observation yiis a real number and where
xi and
θ are column vectors of length
p.
If each observation yi instead has
m real components, stored in a row-vector
yiof length
m, we can use a model of the form
yi=xTiθ+ noise ,i=1,…,N
where θ is a matrix of size
p times
m.
Describe why the formula
ˆθ=(XTX)−1XTY
still gives the least squares optimal parameters if you stack observations into a matrix Y=[y1y2⋮yN] of size
N×m,
Exercise 2.2 - Classification by Logistic Regression - small digit images
The code LogisticRegression_smalldigits.ipynb Links to an external site. does classification of images of numbers (0-9)
a) Experiment with different parameters to the logistic_regression() function. You should get an accuracy of at least 95%
b) Study the confusion-matrix and choose a certain mis-classified image (look for a square with a 1 in). Find the corresponding image in the dataset and save it to a file on your computer. (This is a practical exercise in handling data.)
Exercise 2.3 - Recursive Least Squares
Consider LS estimation of θ based on data from the the model
yi=xTiθ+ei,i=1,…,N
Following the lecture, we know the estimate based on the first N data points is given by
ˆθN=(XTNXN)−1XTNyN . The matrix that gets inverted is of size pxp, where p = nr of features in x.
a) When a new data point yN+1=xN+1θ+eN+1arrives the updated estimate is given by
ˆθN+1=(XTN+1XN+1)−1XTN+1yN+1. Show that the following generates the correct result, using the notation
PN=(XTNXN)−1 and
sN=XTNyN.
P−1N+1=P−1N+xN+1xTN+1sN+1=sN+xN+1yN+1ˆθN+1=PN+1sN+1=ˆθN+PN+1xN+1(yN+1−xTN+1ˆθN)
b) The inversion of a pxp matrix in each iteration can actually be avoided by a clever trick. Show that the first equation in a) can be rewritten as
PN+1=PN−PNxN+1xTN+1PN1+xTN+1PNxN+1
which avoids the matrix inverse. (The trick is a special case of the so called matrix inversion lemma).
Exercise 2.4 Handling Categorical Variables
Study the code Categorical_variables.ipynb Links to an external site. and how the categorical variable 'zone' is transformed to numerical form using either OneHotEncoding or numerical LabelEncoding. What could be advantages and disadvantages of the two methods ?
Exercise 2.5 Weighted Least Squares
If different data points have different reliability it is natural to introduce weights wi in the loss function, so that
θ should minimize
JWLS(θ)=∑Ni=1(yi−θTxi)2w2i
a) Would you choose wi small or large if the data
(yi,xi) is unreliable ?
b) Prove that JWLS(θ) is minimized by
θ=(XTWX)−1XTWY
where W is the diagonal matrix
W=diag(w21,…,w2N). [Hint: Consider an equivalent problem with data
(wiyi,wixi)]
Exercise 2.6 Gradient of logistic regression function
Find a formula for the derivatives ∂J∂θj,j=1,…,p of the logistic regression loss function
J(θ)=∑ndatai=1ln(1+e−yiθTxi). Here
xi denotes a column vector containing the
p inputs for data
i. (This calculation is useful if you want to write your own optimization code using gradient descent.
Exercise 2.7 Investigating the Titanic Survival Dataset
Investigate this data using the code titanic_analysis.ipynb Links to an external site.. You need to download the files titanic_train.csv Download titanic_train.csv and titanic_test.csv Download titanic_test.csv and upload these to your google colab session. The data has a lot of missing entries, and other drawbacks which is handled during some preprocessing steps.
Three different methods, Logistic Regression, KNNs and Random Forests (which we study in Lec3), are then used to predict survival of the different passengers, depending on age, sex, passenger class, etc.
a) Describe what factors increased survival probability.
b) The prediction performance of three methods are evaluated on the training data. We know this is not a reliable method. Change the code to use 5-fold cross-validation instead. Comment on the results.
Note: To help you understand the Pandas toolbox further, you might want to watch this
10 minute guide to pandas (its more like 30min) Links to an external site.
or scan some of the examples on pythonexamples.org/pandas-examples Links to an external site.
Solutions: sol2.pdf Download sol2.pdf