This assignment does not count toward the final grade.

Exam Apr 2023

Due No Due Date
Points 50

Solutions to Exam Apr 2023

--------------------------------------------------------

Instructions: If you hand in some handwritten solutions these should be marked with both

your anonymization code
personal identifier chosen by you.

You also need to name files handed in to canvas using your anonymization code, such as NR.zip or NR-problem1.pdf etc (this helps us match handwritten solutions to correct persons in canvas during the anonymized exam grading).

All solutions must be well motivated. Code should be handed in on problems 1, 3, 4 and 6. The code should be understandable, commenting will help

Max points on the exam is 50.
Decided limits for grades: 3: 25, 4:32, 5: 42.

1. Supervised ML (nonlinear regression) [10pt]

This google colab file contains data for a nonlinear regression problem where 3 input features should be used to predict a real valued output y. You are also given some initial code for a KNN-regressor and for a DecisionTree-regressor.

Improve the solution, by for instance improving data preprocessing, optimizing hyperparameters, improving the evaluation of performance.

Hand in your code and a summary of your results.

Hint: Do not spend too much time on optimizing performance. The main thing to show is that you are handling the problem in an appropriate way. Also, do not spend time on trying other methods, study only the KNN and DT regressors.

2. Problem Dimensionality Analysis [4pt]

The data in the previous problem concern a physical phenomenon that depends on the following physical variables

LaTeX: g - the gravitational acceleration (m/s^2)
$LaTeX: \mu$ - the dynamic viscosity of a fluid (Pa·s or kg/(m·s))
$LaTeX: \rho$ - the density of the fluid (kg/m^3)
$LaTeX: \sigma$ - the surface tension between the fluid and a bubble (N/m or kg/s^2)

a) Find integers LaTeX: a,b,c,d so that $LaTeX: \Pi = g^a \mu^b \rho^c \sigma^d$ is a dimensionless variable.

b) The 3 input features in problem 1 were actually $LaTeX: \mu_i, \rho_i, \sigma_i$ for different data i=1,..N. Use this information to suggest an improved predictor for problem 1. You only have to describe your suggestion in text, you do not need to hand in any code.

Hint: Plot the outputs as a function of $LaTeX: \Pi_i$ for the different data points i=1,..N.

Remark: Do not use the information in problem 2 when solving problem 1.

3. System Identification (ARX vs BoxJenkins) [10pt]

The file systemiddata.mat contains some data from a linear system with one input u and one input y, sampled at the rate h=0.2.

The code sysidproblem.m contains an initial identification of an ARX model with 6 parameters, and a limited evaluation of this model. From the plot of y we can clearly see an oscillatory behavior in the signal, which the ARX model has captured in the system dynamics.

We suspect the observed oscillations can be due to structured noise, and have decided to identify and evaluate a Box-Jenkins model as an alternative.

Task: Find a Box-Jenkins model of the system and compare the performance with the ARX model. Your model should contain at most 6 parameters. Explain why your results indicate that the Box-Jenkins model is better (if you believe it is so...).

Note: You do not have to try any OE or ARMAX structures, or optimize the ARX model further. To save time you also do not have to split data into training and test sets (even if this is normally recommended).

4. SVD / PCA for Leukemia diagnosis [8pt]

Download the file cancer.mat that contains a matrix of size 130*22282 containing measurements for 130 leukemia patients. The patients have been classified into either "type T leukemia" (13 patients) or "type B leukemia" (117 patients). The file also contains similar measurement data (vectors of lengths 22282) for two patients that we want to classify as either "type T" or "type B".

You should use singular value decomposition (SVD) to project this high-dimensional data into a figure in 2D, using the two largest principal components:

a) The file leukemia.m loads the data, and performs a random (bad) projection of data into 2D. The data is not preprocessed. The task is to improve this code by choosing an appropriate preprocessing (some alternatives are given in the code) and to use the result from the SVD to project the the data into 2D.

b) Plot the patient1 and patient2 data in the same figure, using the same transformation as in a). Use what you see in the figure to diagnose the two patients into either "type T" or "type B".

5. Causal Inference [8pt]

The DAG describes a linear structural causal model between a variable X and an outcome Y. There are also other variables A,B,C,D that influence the relation between X and Y. We are interested in estimating the causal effect that an intervention on X has on Y, i.e. what we in the course denoted $LaTeX: \frac{\partial }{\partial x}E[Y \mid \mathbf{do}(X:=x)]$ .

a) Draw an updated diagram illustrating what it means to do an intervention on X.

b) What is the causal effect of X on Y? Express the answer in terms of the coefficients c1, c2, ... , c9.

c) List all backdoor paths from X to Y (in the original DAG).

d) Assume we have historical data, obtained before any intervention. Could the results of regression Y ~ X + A + B + C + D - 1 (with notation as in the course) be used to find the causal effect of X on Y ? Explain how, or why it could not.

e) Assume historical data is costly to obtain. The cost for the different variables are A: 10; B: 20; C: 30; D: 40. Variables X and Y are available without any cost. Find the cheapest set of variables needed to find the causal effect of X on Y, i.e. $LaTeX: \frac{\partial }{\partial x}E[Y \mid \mathbf{do}(X:=x)]$ , and suggest a regression that can be used (give an expression of the form Y ~ ... )

Hint: There is no data or code for this problem. But if you really want to verify your conclusions, you can generate data yourself, if time permits. Code from previous exams can be used as a starting point for this.

6. Grey Box Identification (Thermal Modeling of Mobile Phone) [10pt]

The figure below illustrates a thermal model for a mobile phone. Two inputs are modeled: u1 = temperature in main internal heat source, u2 = temperature in environment. Four temperature sensors are available, measuring temperatures at the display, flash, battery and usb, in the matlab code available as y1, y2, y3 and y4.

a) The figure shows an electrical diagram. Explain how such a diagram can represent a model of the thermal behavior. Use the analogy between thermal and electrical domains.

Download greyboxproblem.m, phoneheat.m and greydata.mat.

greyboxproblem.m: Runs a greybox identification of parameters = [c1,c2,c3,c4,g1,g2,g3,g4,g5,g6,g7]. (Here g_i = 1/R_i are used as parameters instead of R_i, since this has been found to be more efficient. ) You should improve this file.
phoneheat.m: function calculating the A,B,C,D matrices of the system given a parameter vector. You need not change this file, but might want to study it.
greydata.mat: Experimental data used in the grey box identification (data.u has size 1501*2 and data.y has size 1501*4, corresponding to 1501 time instances with a sample rate of 0.2 sec.)

The initial results are not that good, as seen by the produced figures and printouts: The model fit to data is not that excellent [Actually, these figures seem to depend on matlab version, using Matlab2022 the figures seem to look fine] and the reported parameters' standard deviations are huge. You can also notice that especially one singular value of P is extremely large. [These issues will typically still be a problem though]

b) Improve the performance of the greybox identification. In the code some hints are given that might be useful.

c) Also explain why the initial identification had problems. (Hint: Identifiability)

(Note: You do not have to split data into training and test data in this problem, even if this is good practice in general).

** Good luck ! **

https://imgs.xkcd.com/comics/data_trap.png

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --