Exam Jan 2023

Due No Due Date
Points 50
Submitting a file upload
Available Jan 11, 2023 at 7:58am - Jan 11, 2023 at 3:30pm 7 hours and 32 minutes

This assignment was locked Jan 11, 2023 at 3:30pm.

Solution to this exam: exam2023solutions-4.pdf Download exam2023solutions-4.pdf

All solutions must be well motivated.
Code for problems 2,3,5 (marked with *) should be handed in.
Code should be understandable, commenting will help
Max points on the exam is 50.
(Preliminary) limits for grades are 3: 25, 4: 33, 5: 42

Problem 1 [6 points]

For very small waves (small ripples) in deep water it seems reasonable to assume that only surface tension is responsible for the wave motion and that gravity can be neglected. Assume the involved variables are

$\omega$ = wave frequency [1/s]
$\tau$ = surface tension [Newton/meter = kg/s^2]
$L$ = wave length [meter]
$\rho$ = water density [kg/meter^3]

Use dimensionality analysis to show that this indicates a relation of the form

$\tau = \textrm{Const}\cdot \omega^{n_1} L^{n_2} \rho^{n_3}$ (where "Const" is a dimensionless constant.)

and determine the integers $n_1,n_2, n_3$ .

(Hint: You might find this matrix useful $\begin{bmatrix} -1 & -2 & 0 & 0 \\ 0 & 1 & 0 & 1\\ 0 & 0 & 1 & -3 \end{bmatrix}$ )

Problem 2 [12 points] System Identification Hands-on *

The files sysiddata.mat Download sysiddata.mat and sysidcode.m Download sysidcode.m contain some input/output data for a SISO system with sample rate h=0.01 and an initial system identification using ARX systems of varying degree.

The high order model arx10 first seems successful to reproduce the data.

It has been decided to use data from a step response (ystep) as test data. Unfortunately, as you see none of the arx designs manage to correctly match the system's step response.

Identify a better discrete time linear model of the system. Aim for using few model parameters. Be sure to describe your methodology, including outlier analysis, choice of suitable model structure and model order, and include a residual analysis and also results for the step response test. Hand in your matlab code.

(Hint: Useful commands might include: help ident, systemIdentification, arx,oe,armax,bj, present, compare, resid, pwelch, bodeplot, pzmap,...)

Problem 3 [12 points] Supervised Learning/Classification *

This google colab file Links to an external site. describes binary classification of certain objects, based on measurements on 7 different features. You want to use at most 3 of these features (since they are cumbersome to obtain in the future).

a) The code describes an initial attempt on training a KNN classifier, achieving about 70% accuracy. Study the file and improve the accuracy of the KNN classifier. Use at most 3 of the features. (Hint: Somewhere near 85% should be possible.) Hand in your code. Don't forget to motivate your design.

b) Describe how training a Random Forest can help you to guess which features are useful.

c) Assume that misclassification errors are unsymmetrically important so that high false positives rate (FPR) is much worse than false negatives rate (FNR). Describe how you can, without any retraining, reduce FPR (which however also will increase FNR) by using the information in the calculated classification probabilities (probs).

Hint: Do not spend time on trying alternative classifier methods, use the KNN method already described in the code.

Problem 4 [10p] Linear Regression

A friend of you wants to design a predictor for y=shoe size based on x = (height, leg-length) data from N=200 people, based on linear regression. The upper plot shows such a predictor based on x1=height and the lower plot a predictor based on x2 = leg-length (average between left and right leg).

The accuracy is not very impressive, but that is not the reason you friend comes to you for advice: An issue has turned up when designing a linear regressor based on both inputs simultaneously, i.e. x1=height and x2=leg-length: The parameters then became (theta1;theta2;bias) = (30.2 ; -13.8; -4.9) and your friend think it seems strange with theta2 < 0, i.e. that leg-length would be negatively connected to shoe size.

A separate cross-validation (leave-one-out, see the file) also indicated large variations in the estimated values of theta1 and theta2, increasing the skepticism for the obtained estimator.

Download and study the data and code for the previous investigation: shoedata.mat Download shoedata.mat and shoecode.m Download shoecode.m

a) The file shoecode.m contains some unfinished code intended to do leave-one-out cross validation. Complete this code (one line) and report the estimated prediction error, i.e. rmse = sqrt(mean(Ypred-Y).^2). (You do not need to hand in all the code, it is enough you write down the line you introduced).

b) Explain to your friend why using both features (height, leglength) could give this kind of parameter estimates.

c) Describe what the parameter gamma in the code can be used for. You do not need to try to tune gamma.

d) When it comes to estimating shoe size, is the estimator with two features working better, worse, or roughly equally well, as the two estimators shown in the figures above ?

Hints: You do not have to try to improve on the results. Study the singular value decomposition of the regressor matrix X and the vector thetav of size (2,200) which contains all the different parameter estimates performed during the leave-one-out cross validation.

Problem 5 [10 points] Causal inference *

Your manager has given you a dataset with N=10000 measurements of pairs of real-valued variables (X,Y) coming from a somewhat complicated system (involving also some other variables).

She describes that "We have decided to do an intervention and change how the variable X is generated: in the future a fixed value on X will be used. And we want to make Y as close as possible to 15 in the future. Analysis on historical data shows that the relation between X and Y is as seen in the figure: the relation is linear but a little noisy. The best estimate of Y for a certain X is given by Y=3.0*X (red line) as the diagram of existing data shows (before the intervention on X). The problem is however, that our initial tests after setting X=15/3.0=5 did not give Y=15, it was more like Y=10. We will not allow any more experiments on the system until we have understood this issue. Can you please explain and solve this problem?"

After talking with an expert that understands the system better, you come up with a linear structural causal model of how the variables Y and X are related to some other variables A, B, C (hard to measure). The model is described by this directed acyclic graph (DAG)

and this file on google colab Links to an external site. generates hypothetical data according to this model. Study the file to understand more details.

a) Assume the structure of this model is correct. How should one intervene and change X to a fixed value to achieve $E(Y)=15$ , if the parameters c1,...,c7 were known? (Give your answer in terms of c1,...,c7. Variables A,B,C are unknown and can not be used).

b) Explain, in words, why the manager's analysis failed to predict the correct value for X.

The parameters c1,..,c7 are unfortunately unknown, so you can not trust the values for them given in the code (but you can assume the rest of the model is correct).

You tell your manager that it would really help knowing A,B and C, but she says that they are costly to obtain and will only allow the cost for obtaining one of them from historical data (and it is not feasible to use any of them in the future).

c) Describe how you can use ordinary least squares on the historical data (i.e. no interventions) for X, Y and one extra variable (choose either A, B or C) to estimate a value of X that will give E(Y) = 15 in the future.

(Hint: You can assume the graph is correct, but you do not know the true values of c1,...,c7. You can experiment with the code to confirm your idea. The values of c1,c3,c4,c5,c6,c7 will not change in the future. )

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --