January 2022 Take Home Exam

Solutions to this exam are available here exam2022Jan_solutions.pdf

-----------------------

You are not allowed to discuss the exam with anyone else than bo.bernhardsson@control.lth.se

The maximum is 60 points on the exam. The time limit is 48 hours

The handin format can be flexible, such as pdf-report, scanned handwritten text, commented colab notebooks etc. Solutions should however be clearly readable and well motivated, only handing in code and no text is for example not ok.

Good luck.

Problem 1 [10 points] Dimensionality analysis

When watching new years celebrations, LTH students Truls and Trula start discussing how the total energy LaTeX: E [Joule] of an explosion is related to how the shock wave radius LaTeX: R [meter] is expanding with time LaTeX: t [second]. They agree that properties of air must be relevant and therefore guess that air density $LaTeX: \rho$ [ LaTeX: kg/m^3 ] should be involved.

a) Use dimensionality analysis to find exponents LaTeX: a,b,c,d for a physical relation of the form $LaTeX: E^aR^bt^c\rho^d = \mathrm{Const}$ to hold, where $LaTeX: \mathrm{Const}$ is a dimensionless constant.

b) Verify that the following data roughly fit such a relation. The data is from an explosion with a certain, unknown, energy E.

time t [millisecond]	3	5	15	60
radius R [meter]	57	70	106	180

c) Assume that $LaTeX: \mathrm{Const}\approx 1$ holds and that $LaTeX: \rho \approx 1.0 \textrm{ kg}/{\mathrm{m}^3}$ for air. Estimate the energy LaTeX: E using the data in the table above.

d) Also find exponents LaTeX: f,g and LaTeX: h in an expression of the form $LaTeX: P =R^fE^g\rho^h$ describing how the peak pressure [Pascal] should depend on radial distance LaTeX: R , energy LaTeX: E and air density $LaTeX: \rho$ .

Problem 2 [12 points] Supervised Learning Hands-on

This google colab python notebook loads data from a number of patients (save a copy at your local drive). The first 12 columns contain certain measured features, such as age, sex, blood sample data etc. The last column "DEATH_EVENT" indicates if the patient is alive = 0 or dead=1 at a followup a time later. The time, in days, to this followup is in the 12th column, it is also illustrated in the code. There is also code that constructs and evaluates some simple KNN-classifiers; this attempt is however not very professionally done.

a) Mention some obvious drawback(s) with this attempt of classifier design.

b) Make a better design of at least two reasonable classifiers and evaluate their performance. Make sure to document all steps and motivate your design choices, such as data treatment, algorithm choices, hyperparameter tuning and performance evaluating.

c) It is questionable if the algorithm should be allowed to used the feature "time" in the classifier construction. Furthermore, it would be advantageous the fewer features that was used. Choose two of the feature columns from the first 11 columns (motivate your choice) and make a new classifier.

Problem 3 [10 points] System Identification Hands-on

The file problem3.mat contains input LaTeX: u and output LaTeX: y for a SISO system with sample rate LaTeX: h=0.1 . Identify a discrete time linear model of the system. Aim for using few model parameters. Be sure to describe your methodology, including outlier analysis, choice of suitable model structure and model order, and include model validation with residual analysis. Also hand in your matlab code.

(Hint: Useful commands might include help ident, systemIdentification, arx,oe,armax,bj, present, compare, resid, bodeplot,pzmap,...)

Problem 4 [10 points] System Identification Theory

We want to estimate parameters LaTeX: a and LaTeX: b in a nonlinear input output relation of the form

$LaTeX: y(t) = a u(t) + b \exp(u(t)) + e(t), \;\; t=1,\ldots, N$

where LaTeX: u(t) are random independent samples drawn from a uniform distribution in the interval [0,1] and $LaTeX: e(t) \in N(0,1)$ is white Gaussian noise, independent of . The signals LaTeX: y(t) and are measured in an identification experiment for $LaTeX: t=1,\ldots N.$ Note that is NOT Gaussian.

a) Describe how to use least squares linear regression to estimate the true parameter vector $LaTeX: \theta_0 = \begin{bmatrix} a\\ b\end{bmatrix}$ .

b) Describe the asymptotic statistical properties of the estimation error $LaTeX: \hat \theta_N-\theta_0$ , when $LaTeX: N\to\infty$ : Will the estimate $LaTeX: \hat \theta_N$ be bias-free? Gaussian? Also describe the asymptotic error covariance matrix $LaTeX: E[(\hat \theta_N-\theta_0)(\hat \theta_N-\theta_0)^T]$ as $LaTeX: N\to \infty$ .

c) Is the maximum likelihood (ML) estimate of $LaTeX: \theta_0$ given by the solution to the least squares problem in a)?

Problem 5 [10 points] Causal inference and DAGs

The following DAG describes the relations between variables x1,x2,x3 and y. The real number x1 is related to the age of individuals; x2 to their smoking habits; x3 to HIV infection; and y to stroke, with higher values indicating a larger variable (the variables are assumed to already have been transformed to make the linear SCM model below reasonable, we will not go into details about how this was done)

The following linear structured causal model connected to this DAG is assumed to hold:

x1 = n1
x2 = c21*x1 + n2
x3 = c31*x1 + c32*x2 + n3
y = c41*x1 + c42*x2 + c43*x3 + bias + ny

where n1, n2, n3, ny are random Gaussian N(0,1) variables. The coefficients c21, c31, c32, c41, c42, c43 and bias are unknown real numbers. Measurements of (x1,x2,x3,y) from 10000 individuals are available.

The situation is modeled in the google colab code here

a) The causal effect of x3=HIV on y=stroke is given by c43, i.e. $LaTeX: \frac{\partial }{\partial x}E[y \mid \mathbf{do}(x_3:=x)] = c_{43}$ . Describe how one of the following linear regressions (with python's ols-notation) can be used to determine this c43 factor, and also describe why the other expressions will give wrong results

y ~ x3
y ~ x2 + x3
y ~ x1 + x2 + x3

b) The causal effect of x1=age on y=stroke is described by a combination of several different paths (it is NOT only c41). Find an expression of it (in terms of the c-coefficients), i.e. calculate $LaTeX: \frac{\partial }{\partial x}E[y \mid \mathbf{do}(x_1:=x)]$ . Also verify using a code example (hint: use the google colab code as a base) that this causal effect of age on stroke can be obtained, without knowledge of the LaTeX: c -coefficients, from available data (x1,x2,x3 and y) using one of the linear regressions below

y ~ x1
y ~ x1 + x2
y ~ x1 + x2 + x3

c) Does any of the following linear regressions

y ~ x2
y ~ x1 + x2
y ~ x1 + x2 + x3

correctly estimate the causal effect of x2=smoke on y=stroke, i.e. calculate $LaTeX: \frac{\partial }{\partial x}E[y \mid \mathbf{do}(x_2:=x)]$ ? Motivate your answer.

Problem 6 [8 points] Estimation theory

Assume that we want to estimate the parameter $LaTeX: \theta$ , a real number, in the model

$LaTeX: y_n = \exp( \theta t_n) + e_n, \;\; n=1,\ldots,N$

We have data $LaTeX: y_n\text{ and }t_n$ available for $LaTeX: n=1,\ldots N$ , but $LaTeX: e_n\in N(0,1)$ is unknown Gaussian white noise.

Show that the variance of any bias-free estimator $LaTeX: \hat \theta_N$ of $LaTeX: \theta$ must satisfy

$LaTeX: E(\hat \theta_N -\theta)^2 \geq \left( \sum_{n=1}^N t_n^2 \exp{(2t_n\theta)}\right)^{-1}$