Exam Apr 2024

Due No Due Date
Points 50
Submitting a file upload

Solutions: exam2024Apr_solutions.pdf

Allowed aid: All material is allowed, including old exams, internet access and tools such as chatGPT etc.

If chatGPT or similar tool is used, we ask you to briefly describe how it was used (on which problems, what kind of prompts etc. Note: this information is just to give feedback useful for future course development, it will not impact your score.).

Instructions: Name files handed in to Canvas using your anonymization code, such as NR.zip or NR-problem1.pdf etc. We prefer that all solutions are handed in via Canvas (photos of handwritten solutions are fine) but if you really need to hand in some handwritten solutions on papers at the exam that is ok, but these must then be marked with both your anonymization code and your personal identifier.

All solutions must be well motivated.

Code that is relevant for your solutions should be submitted.

Preliminary limits for grades (out of 50p): 3: 25p, 4:33p, 5: 42p.

Good luck !

1 Dimensional Analysis [6p] (figure is just for illustration)

When studying the interface between different fluids, such as water, oil or air, the relative importance of gravitational forces compared to surface tension forces is important. Assume the relevant variables are

$LaTeX: \Delta \rho$ , the difference in density between the two fluids (kg/m^3)
, constant of gravity (m/s^2)
, a characteristic length, for example size of droplets (m)
$LaTeX: \sigma$ , surface tension between the two fluid phases (Newton/m = kg/s^2)

a) Determine a dimensionless quantity $LaTeX: \Pi_1 = (\Delta \rho)^a g^b L^c \sigma^d$ with integers a,b,c,d. Choose LaTeX: a=1 .

b) (Make sure you used LaTeX: a=1 in previous problem). The value of $LaTeX: \Pi_1$ can be used to predict the shape of bubbles or droplets. In some situations surface tension dominates over gravity, leading to nearly spherical shapes, due to minimization of surface area. In other situations gravitational forces dominate, causing deformation from the spherical shapes such as flattening of bubbles or droplets. Do you think it is very low or very high values of $LaTeX: \Pi_1$ that correspond to nearly spherical bubble shapes? Motivate.

2 Supervised Learning [12pt]

This Google colab code studies data from different countries concerning citizens' reported happiness and some other variables. The Happiness score is a value between 0 (bad) and 10 (good). (The data is described here: "World Happiness Report" . You do not have to study that link to solve the problem).

We want to make a predictor of average "Happiness" (column 2) given the other variables (columns 3-8).

a) For several countries some data seems to be missing, being replaced with zero values. Discuss some different ways to handle the problem with the missing data. For full points, you should also use one of these methods.

b) The code implements a simple KNN regressor. If one runs the training several times with different random train-test splits one finds that it gives an RMS error around 0.6-0.9, but the values vary significantly due to the randomness in the split. Suggest, and implement, a more accurate way to evaluate the RMS error performance which is less influenced by such randomness. (Note: Don't change the method or hyperparameters in this subproblem. Only improve the way to evaluate its performance).

c) Improve the RMS error performance of the KNN regressor.

d) Also implement another method of your choice (such as decision tree, random forests, linear regression, SVM, ...).

Do NOT spend too much time to try to optimize performance, correct methodology is the most important thing.

Hand in your code.

3 Evaluating classification performance [6pt]

Assume a certain medical test delivers a score LaTeX: x>0 which can be used to predict if you don't have a disease (="negative" case) or have a disease (="positive" case). Assume the value of is random and distributed according to an exponential distribution

$LaTeX: p\left(x \right)=\frac{1}{\theta}e^{-x/\theta}, \quad x>0$ ,

where the parameter $LaTeX: \theta$ equals a small value $LaTeX: \theta_N$ in the negative case and a large value $LaTeX: \theta_P$ in the positive case. Assume we use the following classifier (with a certain threshold LaTeX: t ) :

IF LaTeX: x < t THEN "negative case" ELSE "positive case"

The figure below illustrates the situation when $LaTeX: \theta_N=1$ , $LaTeX: \theta_P=4$ and LaTeX: t=2 . We would classify a case with LaTeX: x <2 as "negative" and a value LaTeX: x>2 as "positive".

a) Which of the ROC curves A,B and C in the diagram below correspond to the situations i) $LaTeX: \theta_P = \theta_N$ , ii) $LaTeX: \theta_P = 3\theta_N$ , iii) $LaTeX: \theta_P=10 \,\theta_N$ ?

b) Are the following formulas correctly describing the ROC curve and the AreaUnderCurve for this classification problem? Prove or disprove!

$LaTeX: \mathrm{TPR} = (\mathrm{FPR})^{\theta_N/\theta_P}$ and $LaTeX: \mathrm{AUC} = \left(1+\frac{\theta_N}{\theta_P}\right)^{-1}$ .

(Remember: TPR = "True Positive Rate" = TP/(TP+FN) and FPR = False Positive Rate = FP/(FP+TN))

4 Causal Inference [8pt]

The following directed acyclic graph illustrates a linear structured causal model. You might find this Google colab code helpful to generate data and to evaluate your ideas.

We are interested in calculating the causal impact of X on Y, in the course denoted $LaTeX: \frac{\partial }{\partial x}E[Y \mid \mathbf{do}(X:=x)]$ . All variables are real-valued numbers, and each node has an associated linear equation indicated by the graph, such as X = c5*B + c6*Z + noise. The model parameters c1, ..., c10 are considered unknown.

To obtain LaTeX: X and LaTeX: Y is possible without problems, we will assume we can obtain a large amount LaTeX: N of such data pairs LaTeX: (X_i, Y_i), i=1,..,N ,

The other variables LaTeX: A,B,C,D,Z,W are however problematic to get access to. We therefore want to figure out a smart strategy that requires few of these variables.

a) Describe how the backdoor criterion can be used to find the causal effect of LaTeX: X on LaTeX: Y if we measure 2 of the variables LaTeX: A,B,C,D,Z,W . Find all such adjustment sets, with 2 of these variables, that work.

b) A friend of you suggests the following strategy instead, which only requires the measurement of one variable ( LaTeX: W ):

i) "First do ordinary least squares (OLS) estimation of the form W ~ X. The coefficient in front of X will be a correct estimate of (when you use much data i.e. $LaTeX: N\to \infty$ )"

ii) "Then do OLS estimation Y ~ W. The coefficient in front of W will similarly be a correct estimate of $LaTeX: c_{10}$ ."

iii) "Multiply these numbers together, since the expression you are looking for is $LaTeX: c_9c_{10}$ ."

Unfortunately, this procedure does not give the correct causal effect of X on Y. Explain why.

c) Suggest a small change to the procedure in b) which solves the problem, i.e. only data for variables $LaTeX: (X_i, Y_i, W_i), i=1, \ldots, N$ needs to be obtained to give the correct result (asymptotically for large N) for estimating $LaTeX: \frac{\partial }{\partial x}E[Y \mid \mathbf{do}(X:=x)]$ .

Hint: Improve step ii).

You do not have to hand in any code on Problem 4.

5 System Identification - Hands-on [12p]

The file sysid05.mat contains some data from a linear system with one input u and one output y sampled at the rate Ts=1.

The code sysid05.m contains an initial investigation of the data and some not so successful identification.

Identify a discrete time model of the system. Aim for using few model parameters. Be sure to describe your methodology, including outlier analysis and suitable preprocessing, choice of suitable model structure and model order, and include model validation with residual analysis etc.

(Hint: Useful commands might include help ident, systemIdentification, arx,oe,armax,bj, present, compare, resid, bodeplot,pzmap, pwelch, detrend, ...)

Hand in code.

6. Parameter Estimation Theory [6p]

The following distribution is sometimes used in economics as a model for a density function with a slowly decaying tail: $LaTeX: p(x) = \theta x^{-\theta-1}, \quad x> 1$

The figure illustrates the pdf LaTeX: p(x) for some different values of the shape parameter $LaTeX: \theta>0$ ( is zero for $LaTeX: x\leq 1$ ).

We are given N data points $LaTeX: x_1, \ldots, x_N$ independently drawn from this distribution. We want to estimate the parameter $LaTeX: \theta$ using this data.

a) Find a formula for the maximum likelihood estimate $LaTeX: \widehat \theta_{N} := \mathrm{argmax}_\theta \; p(x_1,\ldots,x_N \mid \theta)$ .

b) Determine the asymptotic distribution of the MLE $LaTeX: \widehat \theta_{N}$ as $LaTeX: N \to \infty$ (including information about the asymptotic bias and variance).

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --