FRTN65
Exam Apr 2024
Skip To Content
Dashboard
  • Login
  • Dashboard
  • Calendar
  • Inbox
  • History
  • Help
Close
  • My dashboard
  • FRTN65
  • Assignments
  • Exam Apr 2024
2023 HT/Autumn
  • Home
  • Modules
  • Quizzes
  • Assignments
  • Syllabus

Exam Apr 2024

  • Due No Due Date
  • Points 50
  • Submitting a file upload

Solutions: exam2024Apr_solutions.pdf

 

Allowed aid: All material is allowed, including old exams, internet access and tools such as chatGPT etc.

If chatGPT or similar tool is used, we ask you to briefly describe how it was used (on which problems,  what kind of prompts etc. Note: this information is just to give feedback useful for future course development, it will not impact your score.).

Instructions:  Name files handed in to Canvas using your anonymization code, such as NR.zip or NR-problem1.pdf etc. We prefer that all solutions are handed in via Canvas (photos of handwritten solutions are fine) but if you really need to hand in some handwritten solutions on papers at the exam that is ok, but these must then be marked with both your anonymization code and your personal identifier.

All solutions must be well motivated. 

Code that is relevant for your solutions should be submitted.

Preliminary limits for grades (out of 50p):   3: 25p, 4:33p, 5: 42p.

Good luck !


1 Dimensional Analysis [6p] Multiphase_flow.png (figure is just for illustration)

When studying the interface between different fluids, such as water, oil or air, the relative importance of gravitational forces compared to surface tension forces is important. Assume the relevant variables are

  • LaTeX: \Delta \rho, the difference in density between the two fluids (kg/m^3)
  • LaTeX: g, constant of gravity (m/s^2)
  • LaTeX: L, a characteristic length, for example size of  droplets (m)
  • LaTeX: \sigma, surface tension between the two fluid phases (Newton/m = kg/s^2)

a) Determine a dimensionless quantity LaTeX: \Pi_1 = (\Delta \rho)^a g^b L^c \sigma^dwith integers a,b,c,d. Choose LaTeX: a=1.

b) (Make sure you used LaTeX: a=1 in previous problem).  The value of LaTeX: \Pi_1 can be used to predict the shape of bubbles or droplets. In some situations surface tension dominates over gravity, leading to nearly spherical shapes, due to minimization of surface area. In other situations gravitational forces dominate, causing deformation from the spherical shapes such as flattening of bubbles or droplets. Do you think it is very low or very high values of LaTeX: \Pi_1 that correspond to nearly spherical bubble shapes? Motivate.

 


2 Supervised Learning [12pt]

This Google colab code studies data from different countries concerning citizens' reported happiness and some other variables. The Happiness score is a value between 0 (bad) and 10 (good). (The data is described here: "World Happiness Report" . You do not have to study that link to solve the problem).

We want to make a predictor of average "Happiness" (column 2) given the other variables (columns 3-8).

a) For several countries some data seems to be missing, being replaced with zero values. Discuss some different ways to handle the problem with the missing data. For full points, you should also use one of these methods.

b) The code implements a simple KNN regressor. If one runs the training several times with different random train-test splits one finds that it gives an RMS error around 0.6-0.9, but the values vary significantly due to the randomness in the split. Suggest, and implement, a more accurate way to evaluate the RMS error performance which is less influenced by such randomness. (Note: Don't change the  method or hyperparameters in this subproblem. Only improve the way to evaluate its performance).

c) Improve the RMS error performance of the KNN regressor.

d) Also implement another method of your choice (such as decision tree, random forests, linear regression, SVM, ...).

Do NOT spend too much time to try to optimize performance, correct methodology is the most important thing.

Hand in your code.

 


3 Evaluating classification performance [6pt]

Assume a certain medical test delivers a score LaTeX: x>0 which can be used to predict if you don't have a disease (="negative" case) or have a disease (="positive" case). Assume the value of LaTeX: x is random and distributed according to an exponential distribution

LaTeX: p\left(x \right)=\frac{1}{\theta}e^{-x/\theta}, \quad x>0,

where the parameter LaTeX: \theta equals a small value LaTeX: \theta_N in the negative case and a large value LaTeX: \theta_P in the positive case.  Assume we use the following classifier (with a certain threshold LaTeX: t) :

        IF LaTeX: x < t THEN "negative case" ELSE "positive case"

The figure below illustrates the situation  when LaTeX: \theta_N=1,  LaTeX: \theta_P=4 and LaTeX: t=2. We would classify a case with LaTeX: x <2 as "negative" and a value LaTeX: x>2 as "positive".

exponential.png

a) Which of the ROC curves A,B and C in the diagram below correspond to the situations i) LaTeX: \theta_P = \theta_N , ii) LaTeX: \theta_P = 3\theta_N, iii) LaTeX: \theta_P=10 \,\theta_N?

ROC.png

b) Are the following formulas correctly describing the ROC curve and the AreaUnderCurve for this classification problem? Prove or disprove!

LaTeX: \mathrm{TPR} = (\mathrm{FPR})^{\theta_N/\theta_P}  and  LaTeX: \mathrm{AUC} = \left(1+\frac{\theta_N}{\theta_P}\right)^{-1}.

(Remember: TPR = "True Positive Rate" = TP/(TP+FN) and FPR = False Positive Rate = FP/(FP+TN))

 


4  Causal Inference [8pt]

The following directed acyclic graph illustrates  a linear structured causal model. You might find this Google colab code helpful to generate data and to evaluate your ideas.

We are interested in calculating the causal impact of X on Y, in the course denoted LaTeX: \frac{\partial }{\partial x}E[Y \mid \mathbf{do}(X:=x)] . All variables are real-valued numbers, and each node has an associated linear equation indicated by the graph, such as X = c5*B + c6*Z + noise. The model parameters c1, ..., c10 are considered unknown.

To obtain LaTeX: X and LaTeX: Y is possible without problems, we will assume we can obtain a large amount LaTeX: N of such data pairs LaTeX: (X_i, Y_i), i=1,..,N,

The other variables LaTeX: A,B,C,D,Z,W are however problematic to get access to. We therefore want to figure out a smart strategy that requires few of these variables.

image.png

 

a) Describe how the backdoor criterion can be used to find the causal effect of LaTeX: X on LaTeX: Y if we measure 2 of the variables  LaTeX: A,B,C,D,Z,W. Find all such adjustment sets, with 2 of these variables, that work.

b) A friend of you suggests the following strategy instead, which only requires the measurement of one variable (LaTeX: W):

  i) "First do ordinary least squares (OLS) estimation of the form W ~ X.  The coefficient in front of X will be a correct estimate of LaTeX: c_9 (when you use much data i.e. LaTeX: N\to \infty)"

  ii) "Then do OLS estimation Y ~ W. The coefficient in front of W will similarly be a correct estimate of LaTeX: c_{10}."

  iii) "Multiply these numbers together, since the expression you are looking for is LaTeX: c_9c_{10}."

Unfortunately, this procedure does not give the correct causal effect of X on Y.  Explain why.

c) Suggest a small change to the procedure in b) which solves the problem, i.e. only data for variables LaTeX: (X_i, Y_i, W_i), i=1, \ldots, N needs to be obtained to give the correct result (asymptotically for large N) for estimating  LaTeX: \frac{\partial }{\partial x}E[Y \mid \mathbf{do}(X:=x)] .

Hint: Improve step ii).

You do not have to hand in any code on Problem 4.

 


5 System Identification - Hands-on [12p]

The file sysid05.mat contains some data from a linear system with one input u and one output y sampled at the rate Ts=1.

The code sysid05.m contains an initial investigation of the data and some not so successful identification.

Identify a discrete time model of the system. Aim for using few model parameters. Be sure to describe your methodology, including outlier analysis and suitable preprocessing, choice of suitable model structure and model order, and include model validation with residual analysis etc.

(Hint: Useful commands might include help ident, systemIdentification, arx,oe,armax,bj, present, compare, resid, bodeplot,pzmap, pwelch, detrend, ...)

Hand in code.

 


6.  Parameter Estimation Theory [6p]

The following distribution is sometimes used in economics as a model for a density function with a slowly decaying tail: LaTeX: p(x) = \theta x^{-\theta-1}, \quad  x> 1

pareto.png

The figure illustrates the pdf LaTeX: p(x) for some different values of the shape parameter LaTeX: \theta>0 (LaTeX: p(x) is zero for LaTeX: x\leq 1).

We are given N data points  LaTeX: x_1, \ldots, x_N independently drawn from this distribution. We want to estimate the parameter LaTeX: \theta using this data.

a) Find a formula for the maximum likelihood estimate  LaTeX: \widehat \theta_{N} := \mathrm{argmax}_\theta \; p(x_1,\ldots,x_N \mid \theta).

b) Determine the asymptotic distribution of the MLE LaTeX: \widehat \theta_{N} as LaTeX: N \to \infty (including information about the asymptotic bias and variance).

0
Please include a description
Additional Comments:
Rating max score to > pts
Please include a rating title

Rubric

Find Rubric
Please include a title
Find a Rubric
Title
You've already rated students with this rubric. Any major changes could affect their assessment results.
 
 
 
 
 
 
 
     
Can't change a rubric once you've started using it.  
Title
Criteria Ratings Pts
This criterion is linked to a Learning Outcome Description of criterion
threshold: 5 pts
Edit criterion description Delete criterion row
5 to >0 pts Full Marks blank
0 to >0 pts No Marks blank_2
This area will be used by the assessor to leave comments related to this criterion.
pts
  / 5 pts
--
Additional Comments
Total Points: 5 out of 5