January 2022 Take Home Exam

Solutions to this exam are available here exam2022Jan_solutions.pdf 

-----------------------

You are not allowed to discuss the exam with anyone else than bo.bernhardsson@control.lth.se

The maximum is 60 points on the exam. The time limit is 48 hours

The handin format can be flexible, such as pdf-report, scanned handwritten text, commented colab notebooks etc. Solutions should however be clearly readable and well motivated, only handing in code and no text is for example not ok.

Good luck.

Problem 1 [10 points] Dimensionality analysis

When watching new years celebrations, LTH students Truls and Trula start discussing how the total energy LaTeX: E [Joule] of an explosion is related to how the shock wave radius LaTeX: R [meter] is expanding with time LaTeX: t [second]. They agree that properties of air must be relevant and therefore guess that air density LaTeX: \rho [LaTeX: kg/m^3] should be involved.

a) Use dimensionality analysis to find exponents LaTeX: a,b,c,d for a physical relation of the form LaTeX: E^aR^bt^c\rho^d = \mathrm{Const} to hold, where LaTeX: \mathrm{Const} is a dimensionless constant.

b) Verify that the following data roughly fit such a relation. The data is from an explosion with a certain, unknown, energy E.

time t [millisecond] 3 5 15 60
radius R [meter] 57 70 106 180

c) Assume that LaTeX: \mathrm{Const}\approx 1 holds and that  LaTeX: \rho \approx 1.0 \textrm{ kg}/{\mathrm{m}^3} for air. Estimate the energy LaTeX: E using the data in the table above.

d) Also find exponents LaTeX: f,g and LaTeX: h in an expression of the form LaTeX: P =R^fE^g\rho^h describing how the peak pressure LaTeX: P [Pascal]  should depend on radial distance LaTeX: R, energy LaTeX: E and air density LaTeX: \rho.

 

Problem 2 [12 points] Supervised Learning Hands-on

This google colab python notebook loads data from a number of patients (save a copy at your local drive). The first 12 columns contain certain measured features, such as age, sex, blood sample data etc. The last column "DEATH_EVENT" indicates if the patient is alive = 0 or dead=1 at a followup a time later. The time, in days, to this followup is in the 12th column, it is also illustrated in the code. There is also code that constructs and evaluates some simple KNN-classifiers; this attempt is however not very professionally done.

a) Mention some obvious drawback(s) with this attempt of classifier design.

b) Make a better design of at least two reasonable classifiers and evaluate their performance. Make sure to document all steps and motivate your design choices, such as data treatment, algorithm choices, hyperparameter tuning and performance evaluating.

c) It is questionable if the algorithm should be allowed to used the feature "time" in the classifier construction. Furthermore, it would be advantageous the fewer features that was used. Choose two of the feature columns from the first 11 columns (motivate your choice) and make a new classifier.

 

Problem 3 [10 points] System Identification Hands-on

The file problem3.mat contains input LaTeX: u and output LaTeX: y for a SISO system with sample rate LaTeX: h=0.1. Identify a discrete time linear model of the system. Aim for using few model parameters. Be sure to describe your methodology, including outlier analysis, choice of suitable model structure and model order, and include model validation with residual analysis. Also hand in your matlab code.

(Hint: Useful commands might include help ident, systemIdentification, arx,oe,armax,bj, present, compare, resid, bodeplot,pzmap,...)

 

Problem 4 [10 points] System Identification Theory

We want to estimate parameters LaTeX: a and LaTeX: b in a nonlinear input output relation of the form

LaTeX: y(t) = a u(t) + b \exp(u(t)) + e(t), \;\; t=1,\ldots, N

where LaTeX: u(t) are random independent samples drawn  from a uniform  distribution in the interval [0,1] and LaTeX: e(t) \in N(0,1) is white Gaussian noise, independent of LaTeX: u. The signals LaTeX: y(t) and LaTeX: u(t) are measured in an identification experiment for LaTeX: t=1,\ldots N. Note that LaTeX: u is NOT Gaussian.

a) Describe how to use least squares linear regression to estimate  the true parameter vector LaTeX: \theta_0 = \begin{bmatrix} a\\ b\end{bmatrix}.

b) Describe the asymptotic statistical properties of the estimation error LaTeX: \hat \theta_N-\theta_0, when LaTeX: N\to\infty: Will the estimate LaTeX: \hat \theta_N be bias-free? Gaussian? Also describe the asymptotic error  covariance matrix LaTeX: E[(\hat \theta_N-\theta_0)(\hat \theta_N-\theta_0)^T] as LaTeX: N\to \infty.

c) Is the maximum likelihood (ML) estimate of LaTeX: \theta_0 given by the solution to the least squares problem in a)?

 

Problem 5 [10 points] Causal inference and DAGs

The following DAG describes the relations between variables x1,x2,x3 and y. The real number x1 is related to the age of individuals; x2 to their smoking habits; x3 to HIV infection; and y to stroke, with higher values indicating a larger variable (the variables are assumed to already have been transformed to make the linear SCM model below reasonable, we will not go into details about how this was done)

causal01-1.png

The following linear structured causal model connected to this DAG is assumed to hold:

x1 = n1
x2 = c21*x1 + n2
x3 = c31*x1 + c32*x2 + n3
y = c41*x1 + c42*x2 + c43*x3 + bias + ny

where n1, n2, n3, ny are random Gaussian N(0,1) variables.  The coefficients c21, c31, c32, c41, c42, c43 and bias are unknown real numbers. Measurements of (x1,x2,x3,y) from 10000 individuals are available.

The situation is modeled in the google colab code here

a) The causal effect of x3=HIV on y=stroke is given by c43, i.e. LaTeX: \frac{\partial }{\partial x}E[y \mid \mathbf{do}(x_3:=x)] = c_{43}. Describe how one of the following linear regressions (with python's ols-notation) can be used to determine this c43 factor, and also describe why the other expressions will give wrong results

y ~ x3
y ~ x2 + x3
y ~ x1 + x2 + x3

b) The causal effect of x1=age on y=stroke is described by a combination of several different paths (it is NOT only c41). Find an expression of it (in terms of the c-coefficients), i.e. calculate LaTeX: \frac{\partial }{\partial x}E[y \mid \mathbf{do}(x_1:=x)]. Also verify using a code example (hint: use the google colab code as a base) that this causal effect of age on stroke can be obtained, without knowledge of the LaTeX: c-coefficients, from available data (x1,x2,x3 and y) using one of  the linear regressions below

y ~ x1
y ~ x1 + x2
y ~ x1 + x2 + x3

c) Does any of the following linear regressions

y ~ x2
y ~ x1 + x2
y ~ x1 + x2 + x3

correctly estimate the causal effect of x2=smoke on y=stroke, i.e. calculate LaTeX: \frac{\partial }{\partial x}E[y \mid \mathbf{do}(x_2:=x)]? Motivate your answer.

Problem 6 [8 points] Estimation theory

Assume that we want to estimate the parameter LaTeX: \theta, a real number,  in the model

LaTeX: y_n = \exp( \theta t_n) + e_n, \;\; n=1,\ldots,N

We have data LaTeX: y_n\text{ and }t_n available for LaTeX: n=1,\ldots N, but LaTeX: e_n\in N(0,1) is unknown Gaussian white noise. 

Show that the variance of any bias-free estimator LaTeX: \hat \theta_N of LaTeX: \theta must satisfy

LaTeX: E(\hat \theta_N -\theta)^2 \geq \left( \sum_{n=1}^N t_n^2 \exp{(2t_n\theta)}\right)^{-1}