Exercise 4

Due No Due Date
Points None

Exercise 4.1 Boosting on sonar data

Run the notebook ex04_boosting.ipynb Links to an external site. where you will tweak some code that implements boosting with a weight vector and a simple loop. Then you compare it to sklearns AdaBoost implementation and see how well you got it to work, you should be able to get to a similar performance with rather simple means.

Exercise 4.2 Singular Value Decomposition

a) Prove that if LaTeX: U $U$ is a unitary matrix LaTeX: (U^TU=UU^T = I) $(U^TU=UU^T = I)$ and LaTeX: y=Ux $y=Ux$ then $LaTeX: \Vert y \Vert^2_2 = \Vert x \Vert^2_2$ $\Vert y \Vert^2_2 = \Vert x \Vert^2_2$ , i.e. the sum of squares of the matrix elements is the same for $y$ and LaTeX: x $x$ ("Unitary matrices don't change lengths").

b) Given a matrix LaTeX: A $A$ , one way of finding the U,S and V in the SVD decomposition LaTeX: A=USV^T $A=USV^T$ is to compute the eigenvectors and eigenvalues of the two matrices LaTeX: AA^T $AA^T$ and LaTeX: A^TA $A^TA$ . Explain why, and how LaTeX: U,S,V $U,S,V$ can be obtained from this information. (But it is not the best method)

c) Use a SVD LaTeX: X=USV^T $X=USV^T$ to rewrite the normal equation $LaTeX: X^TX\theta = X^Ty$ $X^TX\theta = X^Ty$ , and solve for $LaTeX: \theta$ $\theta$ . Show the solution (if $X^TX$ is invertible) is given by $LaTeX: \widehat{\theta}=V_1\left(S_1\right)^{-1}U_1^Ty$ $\widehat{\theta}=V_1\left(S_1\right)^{-1}U_1^Ty$ .

d) Same problem but with Tikhonov regularisation, for which $LaTeX: (X^TX+\gamma I_p)\theta = X^Ty$ $(X^TX+\gamma I_p)\theta = X^Ty$

e) On the lecture we introduced the projections LaTeX: P_1=U_1U_1^T $P_1=U_1U_1^T$ and LaTeX: Q_1=V_1V_1^T $Q_1=V_1V_1^T$ . Prove the stated claims on slides 37-38 (i.e. that LaTeX: P_1^2=P_1 $P_1^2=P_1$ etc)

f) The SVD can be used to approximate a matrix LaTeX: A=USV^T $A=USV^T$ with a matrix of lower rank. Write matlab or python code that takes an image, represented as a matrix of pixel intensities, and calculates an optimal rank LaTeX: k $k$ approximation LaTeX: A_k $A_k$ . Try different values of LaTeX: k=1,2,10,... $k=1,2,10,...$ and plot the images $A_k$ .

Hint: LaTeX: A_k $A_k$ = LaTeX: U_1S_1V_1^T $U_1S_1V_1^T$ where LaTeX: U_1,S_1 $U_1,S_1$ and LaTeX: V_1 $V_1$ are the parts corresponding to the LaTeX: k $k$ largest singular values.

Exercises 4.3 Kmeans and precoding Investigate the code ex04_kmeans.ipynb Links to an external site. which uses the following idea: First it clusters the data vectors of length 64 corresponding to flattened 8x8 images of the numbers (0-9) in 10 clusters. Then it represents each image as a vector of 10 coordinates corresponding to the distances to the 10 cluster centers, so each image is now represented by a vector of length 10. It then compares performance of a logistic regression classifier with these 10 numbers to a classifier that uses the 64 numbers.

(Extra: you can plot the 10 class centers to get a visual intuition on what the k-means algorithm is doing.)

Unfortunately, it seems the performance was worsened. But the idea can actually be made to work. Try to tune the algorithm so that a positive improvement is obtained. You should be able to get about 1% average improvement, possibly even 2%.

Exercises 4.4 Kmeans Implement your own K-means algorithm (choose language yourself). Try it on a data set of your choice.

Exercise 4.5 LDA

Prove that the decision regions in LDA are bounded by linear functions.

Hint: The decision region between two classes is most easily found by using the fact that the LDA classifier chooses the cluster LaTeX: i $i$ with the highest probability, which according to Bayes' rule is proportional to the log-likelihood scaled with the prior (probability of data belonging to a certain class $i$ ):
$LaTeX: \log p(y=i | x_\star) \sim \log(p(x_\star | y=i) p(y=i)) =\textrm{const}_{i} -\frac{1}{2} (x_\star-\mu_i)^T \Sigma^{-1} (x_\star-\mu_i)$ $\log p(y=i | x_\star) \sim \log(p(x_\star | y=i) p(y=i)) =\textrm{const}_{i} -\frac{1}{2} (x_\star-\mu_i)^T \Sigma^{-1} (x_\star-\mu_i)$ .

Exercise 4.6 - PCA and MNIST

Investigate the code ex04_pcaMNIST.ipynb Links to an external site.
a) How far can you compress the images before classifier performance deteriorates?

b) In terms of representing as much variance as possible with as few components as possible, was it a good idea to apply the standard scaler on each pixel? Try without.

Solutions sol4new.pdf Download sol4new.pdf

exercise 4.3: Try n_clusters=100

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --