Exam Aug 2024

Due No Due Date
Points 50
Submitting a file upload

Allowed aid: All material is allowed, including old exams, internet access and tools such as chatGPT etc.

If chatGPT or similar tool is used, we ask you to briefly describe how it was used (on which problems, what kind of prompts etc. Note: this information is just to give feedback useful for future course development, it will not impact your score.).

Instructions: Name files handed in to Canvas using your anonymization code, such as NR.zip or NR-problem1.pdf etc. We prefer that all solutions are handed in via Canvas (photos of handwritten solutions are fine) but if you really need to hand in some handwritten solutions on papers at the exam that is ok, but these must then be marked with both your anonymization code and your personal identifier.

All solutions must be well motivated.

Code that is relevant for your solutions should be submitted.

Preliminary limits for grades (out of 50p): 3: 25p, 4:33p, 5: 42p.

Good luck !

1 Dimensional Analysis [4p]

To study the force LaTeX: F generated by a propeller on a small drone let us assume that the relevant variables are

$LaTeX: \rho$ , the density of air (kg/m^3)
$LaTeX: \omega$ , the angular rate of the propeller (1/s)
, the length of the propeller (m)

Use dimensional analysis to determine a physically motivated relation of the form $LaTeX: F = \textrm{const} \cdot L^a \rho^b\omega^c$ with integers a,b,c. (The constant will depend on the shape of the propeller).

2 Evaluating classification performance [8pt]

In binary classification we often use classifiers that compare a score LaTeX: x with a threshold LaTeX: t :

IF LaTeX: x < t THEN "negative case" ELSE "positive case"

It can in some situations be unclear what should be defined as "positive cases" and "negative cases. This problem concerns the consequences of this.

The figure below illustrates the score LaTeX: x that is the output from two slightly different implementations of the same classifier. The only difference between implementation 1 and 2 is a switched sign of the score :

In implementation 1 (left), "cancer" is defined as the "positive case" and the classifier generates a higher score for the blue cases (cancer) than for the red cases (healthy).
In implementation 2 (right), "healthy" is instead defined as the "positive case" and the sign of the score is flipped (score is instead).

There are 1000 healthy patients and 100 cancer patients (it is the same patients in the two figures).

a) The following figure illustrates the ROC curve for implementation 1

ROC curve for version 1

What will be true for the ROC curve for implementation 2 ? (Motivate!)

The ROC curve will be exactly the same
The ROC curve will be a mirrored version of the curve above
The two curves can be completely different

b) AUC = 0.918 for implementation 1. Will AUC for implementation 2 be the same ? Motivate.

c) The following figure illustrates the precision vs recall curve for implementation 1.

Precision vs Recall for implementation 1

For implementation 1, use this figure to determine (approximately) the best achievable F1-score ( = 2/(1/precision + 1/recall)) if threshold LaTeX: t is chosen optimally.

d) What will be true for the precision-recall curve for implementation 2 ? (Motivate!)

The curve will be exactly the same
The curve will be a mirrored version of the curve above
The two curves can be completely different

(Reminder: TPR = TP/(TP+FN) , FPR = FP/(FP+TN), Precision = TP/(TP+FP), Recall = TPR)

3 Supervised Learning [12pt]

This Google colab code studies a classification problem: predicting quality of 1599 different red wines based on 11 measured input variables. Quality is given by an integer in the range 1-10, and the input variables are numerical values describing e.g acidity, sugar, alcohol levels etc.

a) Describe some drawbacks with the existing code and how to improve it. Do not spend much time optimizing performance in this subproblem.

b) Rewrite the code so the prediction is treated as a regression problem, where quality is predicted as a real value, and optimize mean square error (MSE) instead.

Hand in your code.

Your solutions need to be well motivated and explained. Only handing in the code will not suffice. (Motivations and explanations can be written as comments in your code if you want.)

4 Causal Inference [7pt]

The following directed acyclic graph illustrates a linear structured causal model. You might find this Google colab code helpful to generate data from the model.

We are interested in calculating the causal impact of three different variables (, LaTeX: B , and ) on the output LaTeX: Y . Remember that in the course the causal effect of LaTeX: X on was defined as $LaTeX: \frac{\partial }{\partial x}E[Y \mid \mathbf{do}(X:=x)]$ . All variables are real-valued numbers, and each node has an associated linear equation indicated by the graph, such as $LaTeX: A = c_2 B + c_5 C + \textrm{noise}$ , etc. The model parameters $LaTeX: c_1, \ldots, c_6$ are considered unknown.

Which of the statements below are correct concerning the coefficients in the ordinary least squares (OLS) regression Y~ A + B + C - 1 ?

a) True or False: The coefficient of A measures the causal effect of A on Y (Hint: This causal effect equals LaTeX: c_1 )

b) True or False: The coefficient of B measures the causal effect of B on Y. Also, what is the correct value of this causal effect (give an expression using coefficients $LaTeX: c_1, \ldots, c_6$ ) ?

c) True or False: The coefficient of C measures the causal effect of C on Y. Also, what is the correct value of this causal effect (give an expression using coefficients $LaTeX: c_1, \ldots, c_6$ ) ?

d) Furthermore: In case that a, b or c are false, then suggest an alternative linear regression that would give the correct result instead.

5 System Identification - Hands-on [12p]

The file sysid.mat contains some data from a linear system with one input u and one output y sampled at the rate h=0.5.

The code sysidproblem.m contains an initial very quick investigation of the data.

Identify a discrete time model of the system. Aim for using few model parameters. Be sure to describe your methodology, including outlier analysis and suitable preprocessing, choice of suitable model structure and model order, and include model validation with residual analysis etc.

(Hint: Useful commands might include help ident, systemIdentification, arx,oe,armax,bj, present, compare, resid, bodeplot,pzmap, pwelch, detrend, ...)

Also hand in your code.

Your solution needs to be well motivated. Only handing in uncommented code will give a low score.

6. System Identification Theory [7p]

Assume we want to estimate parameters $LaTeX: b_0 \textrm{ and } b_1$ in the model

$LaTeX: y(t) = b_0 u(t) + b_1u(t-1) + e(t), \quad t=2,\ldots, N$

Here LaTeX: u (known) and LaTeX: e (unknown) are random signals with zero mean and with $LaTeX: E(u^2(t))=\sigma_u^2$ and $LaTeX: E(e^2(t))=\sigma_e^2$ , for all LaTeX: t .

a) Write the estimation, where data for t=2 to N is used, as a least squares problem of the form $LaTeX: Y = \Phi \theta + E$ .

b) If the signals LaTeX: u and LaTeX: e are white noise, with and independent, we know from the course that the parameter estimates are asymptotically correct. For this case, determine the matrix LaTeX: P in the expression for the asymptotic estimation error

$LaTeX: \sqrt{N} (\widehat \theta_N - \theta_0) \to N(0, P), \quad \textrm{ when } N\to \infty$ .

c) Will the parameter estimates converge to the true values if LaTeX: e and LaTeX: u are independent white noise (same situation as in b) but the model instead is

$LaTeX: y(t) = b_0 u(t) + b_1u(t-1) + e(t) + 0.5 e(t-1), \quad t=2,\ldots, N$

All solutions must be well motivated.

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --