Exam Apr 2024
- Due No Due Date
- Points 50
- Submitting a file upload
Solutions: exam2024Apr_solutions.pdf
Allowed aid: All material is allowed, including old exams, internet access and tools such as chatGPT etc.
If chatGPT or similar tool is used, we ask you to briefly describe how it was used (on which problems, what kind of prompts etc. Note: this information is just to give feedback useful for future course development, it will not impact your score.).
Instructions: Name files handed in to Canvas using your anonymization code, such as NR.zip or NR-problem1.pdf etc. We prefer that all solutions are handed in via Canvas (photos of handwritten solutions are fine) but if you really need to hand in some handwritten solutions on papers at the exam that is ok, but these must then be marked with both your anonymization code and your personal identifier.
All solutions must be well motivated.
Code that is relevant for your solutions should be submitted.
Preliminary limits for grades (out of 50p): 3: 25p, 4:33p, 5: 42p.
Good luck !
1 Dimensional Analysis [6p] (figure is just for illustration)
When studying the interface between different fluids, such as water, oil or air, the relative importance of gravitational forces compared to surface tension forces is important. Assume the relevant variables are
, the difference in density between the two fluids (kg/m^3)
, constant of gravity (m/s^2)
, a characteristic length, for example size of droplets (m)
, surface tension between the two fluid phases (Newton/m = kg/s^2)
a) Determine a dimensionless quantity with integers a,b,c,d. Choose
.
b) (Make sure you used in previous problem). The value of
can be used to predict the shape of bubbles or droplets. In some situations surface tension dominates over gravity, leading to nearly spherical shapes, due to minimization of surface area. In other situations gravitational forces dominate, causing deformation from the spherical shapes such as flattening of bubbles or droplets. Do you think it is very low or very high values of
that correspond to nearly spherical bubble shapes? Motivate.
2 Supervised Learning [12pt]
This Google colab code studies data from different countries concerning citizens' reported happiness and some other variables. The Happiness score is a value between 0 (bad) and 10 (good). (The data is described here: "World Happiness Report" . You do not have to study that link to solve the problem).
We want to make a predictor of average "Happiness" (column 2) given the other variables (columns 3-8).
a) For several countries some data seems to be missing, being replaced with zero values. Discuss some different ways to handle the problem with the missing data. For full points, you should also use one of these methods.
b) The code implements a simple KNN regressor. If one runs the training several times with different random train-test splits one finds that it gives an RMS error around 0.6-0.9, but the values vary significantly due to the randomness in the split. Suggest, and implement, a more accurate way to evaluate the RMS error performance which is less influenced by such randomness. (Note: Don't change the method or hyperparameters in this subproblem. Only improve the way to evaluate its performance).
c) Improve the RMS error performance of the KNN regressor.
d) Also implement another method of your choice (such as decision tree, random forests, linear regression, SVM, ...).
Do NOT spend too much time to try to optimize performance, correct methodology is the most important thing.
Hand in your code.
3 Evaluating classification performance [6pt]
Assume a certain medical test delivers a score which can be used to predict if you don't have a disease (="negative" case) or have a disease (="positive" case). Assume the value of
is random and distributed according to an exponential distribution
,
where the parameter equals a small value
in the negative case and a large value
in the positive case. Assume we use the following classifier (with a certain threshold
) :
IF THEN "negative case" ELSE "positive case"
The figure below illustrates the situation when ,
and
. We would classify a case with
as "negative" and a value
as "positive".
a) Which of the ROC curves A,B and C in the diagram below correspond to the situations i) , ii)
, iii)
?
b) Are the following formulas correctly describing the ROC curve and the AreaUnderCurve for this classification problem? Prove or disprove!
and
.
(Remember: TPR = "True Positive Rate" = TP/(TP+FN) and FPR = False Positive Rate = FP/(FP+TN))
4 Causal Inference [8pt]
The following directed acyclic graph illustrates a linear structured causal model. You might find this Google colab code helpful to generate data and to evaluate your ideas.
We are interested in calculating the causal impact of X on Y, in the course denoted . All variables are real-valued numbers, and each node has an associated linear equation indicated by the graph, such as X = c5*B + c6*Z + noise. The model parameters c1, ..., c10 are considered unknown.
To obtain and
is possible without problems, we will assume we can obtain a large amount
of such data pairs
,
The other variables are however problematic to get access to. We therefore want to figure out a smart strategy that requires few of these variables.
a) Describe how the backdoor criterion can be used to find the causal effect of on
if we measure 2 of the variables
. Find all such adjustment sets, with 2 of these variables, that work.
b) A friend of you suggests the following strategy instead, which only requires the measurement of one variable ():
i) "First do ordinary least squares (OLS) estimation of the form W ~ X. The coefficient in front of X will be a correct estimate of (when you use much data i.e.
)"
ii) "Then do OLS estimation Y ~ W. The coefficient in front of W will similarly be a correct estimate of ."
iii) "Multiply these numbers together, since the expression you are looking for is ."
Unfortunately, this procedure does not give the correct causal effect of X on Y. Explain why.
c) Suggest a small change to the procedure in b) which solves the problem, i.e. only data for variables needs to be obtained to give the correct result (asymptotically for large N) for estimating
.
Hint: Improve step ii).
You do not have to hand in any code on Problem 4.
5 System Identification - Hands-on [12p]
The file sysid05.mat contains some data from a linear system with one input u and one output y sampled at the rate Ts=1.
The code sysid05.m contains an initial investigation of the data and some not so successful identification.
Identify a discrete time model of the system. Aim for using few model parameters. Be sure to describe your methodology, including outlier analysis and suitable preprocessing, choice of suitable model structure and model order, and include model validation with residual analysis etc.
(Hint: Useful commands might include help ident, systemIdentification, arx,oe,armax,bj, present, compare, resid, bodeplot,pzmap, pwelch, detrend, ...)
Hand in code.
6. Parameter Estimation Theory [6p]
The following distribution is sometimes used in economics as a model for a density function with a slowly decaying tail:
The figure illustrates the pdf for some different values of the shape parameter
(
is zero for
).
We are given N data points independently drawn from this distribution. We want to estimate the parameter
using this data.
a) Find a formula for the maximum likelihood estimate .
b) Determine the asymptotic distribution of the MLE as
(including information about the asymptotic bias and variance).